AFRL-RY-WP-TP-2007-1223,  V2 


DIFFUSION  MAPS  AND  GEOMETRIC  HARMONICS  FOR 
AUTOMATIC  TARGET  RECOGNITION  (ATR) 

Volume  2:  Appendices 

Steven  W.  Zucker  and  Ronald  Coifman 
Yale  University 


NOVEMBER  2007 
Final  Report 


Approved  for  public  release;  distribution  unlimited. 

See  additional  restrictions  described  on  inside  pages 


STINFO  COPY 


AIR  FORCE  RESEARCH  LABORATORY 
SENSORS  DIRECTORATE 

WRIGHT-PATTERSON  AIR  FORCE  BASE,  OH  45433-7320 
AIR  FORCE  MATERIEL  COMMAND 
UNITED  STATES  AIR  FORCE 


NOTICE 


Using  Government  drawings,  specifications,  or  other  data  included  in  this  document  for 
any  purpose  other  than  Government  procurement  does  not  in  any  way  obligate  the  U.S. 
Government.  The  fact  that  the  Government  formulated  or  supplied  the  drawings, 
specifications,  or  other  data  does  not  license  the  holder  or  any  other  person  or  corporation; 
or  convey  any  rights  or  permission  to  manufacture,  use,  or  sell  any  patented  invention  that 
may  relate  to  them. 

This  report  was  cleared  for  public  release  by  the  Air  Force  Research  Laboratory  Public 
Affairs  Office  and  is  available  to  the  general  public,  including  foreign  nationals.  Copies  may 
be  obtained  from  the  Defense  Technical  Information  Center  (DTIC)  (http://www.dtic.mil). 


THIS  REPORT  HAS  BEEN  REVIEWED  AND  IS  APPROVED  FOR  PUBLICATION  IN 
ACCORDANCE  WITH  ASSIGNED  DISTRIBUTION  STATEMENT. 


*//signature// 


//signature// 


GREGORY  ARNOLD,  Ph.D. 
Project  Engineer 

ATR  &  Fusion  Algorithms  Branch 
Sensor  ATR  Technology  Division 


DEVERT  W.  WICKER,  Ph.D. 

Acting  Chief,  ATR  &  Fusion  Algorithms  Branch 
Sensor  ATR  Technology  Division 
Sensors  Directorate 


//signature// 


STEVEN  P.  WEBBER.  LtCol,  USAF 
Deputy  Chief,  Sensor  ATR  Technology  Division 
Sensors  Directorate 


This  report  is  published  in  the  interest  of  scientific  and  technical  information  exchange  and  its 
publication  does  not  constitute  the  Government’s  approval  or  disapproval  of  its  ideas  or  findings. 


*Disseminated  copies  will  show  “//signature//”  stamped  or  typed  above  the  signature  blocks. 


REPORT  DOCUMENTATION  PAGE 


Form  Approved 
OMB  No.  0704-0188 


The  public  reporting  burden  for  this  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  searching  existing  data 
sources,  gathering  and  maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of 
information,  including  suggestions  for  reducing  this  burden,  to  Department  of  Defense,  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports  (0704-0188),  1215  Jefferson 
Davis  Highway,  Suite  1204,  Arlington,  VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  any  penalty  for  failing  to  comply  with  a 
collection  of  information  if  it  does  not  display  a  currently  valid  OMB  control  number.  PLEASE  DO  NOT  RETURN  YOUR  FORM  TO  THE  ABOVE  ADDRESS. 


Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std.  Z39-18 


1 


"The  views  and  conclusions  contained  herein  are  those  of  the  authors  and  should  not  be  interpreted  as  necessarily  representing  the  official  policies  or 
endorsements,  either  expressed  or  implied,  of  AFRL/SNAT  (now  AFRL/RYAT)  or  the  U.S.  Government." 


"This  material  is  based  on  research  sponsored  by  AFRL/SNAT  under  agreement  number  FA8650-05-1-1800  (BAA  04-03-SNK  Amendment  3).  The  U.S. 
Government  is  authorized  to  reproduce  and  distribute  reprints  for  Governmental  purposes  notwithstanding  any  copyright  notation  thereon." 

Geometric  diffusions  for  the  analysis  of  data  from 
sensor  networks 
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Harmonic  analysis  on  manifolds  and  graphs  has  recently  led  to  mathematical  developments  in  the  field  of  data 
analysis.  The  resulting  new  tools  can  be  used  to  compress  and  analyze  large  and  complex  data  sets,  such  as 
those  derived  from  sensor  networks  or  neuronal  activity  datasets,  obtained  in  the  laboratory  or  through  computer 
modeling.  The  nature  of  the  algorithms  (based  on  diffusion  maps  and  connectivity  strengths  on  graphs) 
possesses  a  certain  analogy  with  neural  information  processing,  and  has  the  potential  to  provide  inspiration  for 
modeling  and  understanding  biological  organization  in  perception  and  memory  formation. 
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Introduction 

Data  processing  and  analysis  has  always  been  a  vital  component  of  scientific  research;  increasingly  so  in  our 
times  [1 4**,5],  when  highly  resolved  sensing  in  space  and  time  gives  rise  to  huge,  high-dimensional 
datasets.  The  same  holds  when  the  data  are  the  result  of  fine-grained  computational  modeling,  rather  than 
sensor  output.  In  neuroscience,  there  are  myriad  sources  of  very  high  dimensional  data.  Perhaps  the 
simplest  example  is  a  single  spike  train,  or  a  sequence  of  100  to  10,000  such  trains  [6].  The  situation 
becomes  much  more  interesting  (and  much  more  complicated)  when  one  considers  evaluating  the 
information  in  electrode  arrays  in,  for  example,  the  retina  [7],  the  hippocampus  [8,9]  or  the  motor  cortex 
[10].  Apart  from  these  foundational  questions,  ‘untangling  the  distributed  code’  (e.g.  [11,12])  is  now  a  key 
question  for  developing  man-machine  interfaces  [10,11],  and  is  not  unlike  related  questions  for  the  analysis 
of  EEG  and  MEG  signals.  The  techniques  described  here  should  be  relevant  to  many  of  these  tasks,  both 
for  developing  processing  algorithms  and  for  determining  the  level  of  structure  and  intrinsic  information  in 
the  signals.  The  additional  feature  of  extracting  higher  order  concepts  from  data  computationally  resonates 
with  the  way  such  concepts  are  extracted  from  data  physiologically.  We  comment  on  some  such  tentative 
‘cognitive  processing’  features  of  our  data  processing  algorithms. 

The  mathematical  theory  underpinning  these  new  data  analysis  algorithms  is  that  of  harmonic  analysis  on 

sets  of  data  represented  as  points  lying  in  n-dimensional  Euclidean  space,  Rn  ,  and  on  graphs  constructed 
using  this  data.  These  graphs,  connecting  data  points  in  a  way  to  be  described  below,  are  in  a  way 
reminiscent  of  the  interconnectivity  graphs  of  sensor  nodes  (or  neurons)  in  which  the  strength  of  the 
connections  represents  a  high  affinity  between  nodes.  The  main  challenge  involving  the  analysis  of  such 
complex  structures  lies  in  the  ability  to  explain  the  transition  from  local  ‘affinities’  of  massive  sensor 
outputs,  or  data,  to  some  higher  order  concepts,  regions  of  influence  and  connectivities  on  a  macroscopic 
scale.  The  mathematical  theory  described  here  leads  to  various  computational  methodologies  useful  in  data 
analysis  and  machine  learning  and,  as  such,  provides  a  powerful  tool  for  empirical  modeling. 
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One  goal  of  this  review  is  to  present  these  developments  in  data  analysis;  a  second  goal  is  to  provide  some 
insight  into  mathematical  processing  mechanisms.  These  might  be  useful  to  the  scientist  studying 
empirical  data  processing  and  biological  information  processing  in  the  formulation  of  potential  models  of 
neuronal  organization  (or  sensor  fusion)  at  different  levels  of  granularity.  Our  approach  gives  rise  to  Markov 
processes  on  graphs  constructed  using  the  data;  and  uses  spectral  theory  and  eigenfunctions  of  these 
Markov  processes  [T*,2**],  leading  to  a  natural  geometric  organization  of  complex  data  sets,  providing  a 
‘nonlinear’  principal  component  analysis.  We  remark  in  passing  that  the  top  eigenfunction,  corresponding 
to  the  highest  eigenvalue,  for  the  Web  graph  provides  the  ‘importance  ranking’  used  by  ‘Google’®  for 
webpage  ranking,  whereas  the  subsequent  eigenfunctions  provide  a  more  detailed  mapping.  More 
importantly,  we  show  how  these  eigenfunctions,  viewed  as  a  mathematical  and  computational  tool,  can  be 
replaced  by  ‘aggregates  of  nodes’,  equipped  with  a  notion  of  multiscale  affinity  which  can,  in  principle,  be 
implemented  biologically  through  various  linking  systems.  This  provides  a  potential  theoretical  mechanism 
for  simple  emergent  organization  and  learning  that  might  have  biological  relevance.  Although  related  ideas 
appear  in  a  variety  of  contexts  of  data  analyses,  such  as  spectral  graph  theory  [13],  manifold  learning  and 
nonlinear  dimensionality  reduction  [14-17],  we  augment  them  by  showing  that  the  diffusion  distances  are 
key  intrinsic  geometric  quantities  linking  spectral  theory  of  Markov  processes  to  the  corresponding 
geometry  of  the  data,  relating  localization  in  spectrum  to  localization  in  data  space  [2].  Existing 
dimensionality  reduction  techniques  typically  focus  either  on  global  or  on  local  features  of  the  data;  our 
methodology  integrates  features  at  all  scales  in  a  coherent  multiscale  structure. 

Geometric  diffusions  for  global  structure  definition  of  data 

In  applied  mathematics  we  often  view  ensembles  of  data  as  graphs  with  a  large  number  of  vertices,  with 
each  vertex  being  a  data  point  (e.g.  a  visual  stimulus),  and  edges  connecting  very  similar  data  points  (in  an 
application-specific  sense).  For  example,  two  visual  stimuli  could  be  considered  similar  if  they  excite  a 
visual  receptor  in  a  very  similar  way. 

Discovering  large-scale  structures  and  extracting  information  from  such  graphs  is,  in  general,  a  very 
challenging  task.  Often  the  data  are  high-dimensional,  that  is,  represented  by  long  strings  of  numbers 
(vectors);  however,  physical  or  other  constraints  force  the  set  of  points  or  their  probability  densities  to  be 
intrinsically  lower-dimensional,  so  they  can,  in  principle,  be  described  by  a  small  number  of  degrees  of 
freedom  [1**,2**,14— 17, 18**,19**].  Our  goal  is  to  organize  and  process  the  data  so  as  to  reveal  the  low¬ 
dimensional  structure.  We  use  diffusion  semigroups  to  generate  various  multiscale  inference  (or  affinity) 
geometries  (ontologies). 

We  show  that  appropriately  selected  eigenfunctions  of  Markov  matrices  describing  local  transitions,  or 
affinities  in  the  system,  lead  to  coarse-grained,  macroscopic  structures  at  different  scales. 

In  particular,  the  leading  eigenfunctions  enable  a  low  dimensional  geometric  embedding  of  the  dataset  into 
a  lower-dimensional  Euclidean  space,  so  that  the  ordinary  Euclidean  distance  in  the  embedding  space 
measures  intrinsic  diffusion  (inference,  affinity  or  relevance)  metrics  of  the  data. 

The  Euclidean  correlation  in  R  n  ,  for  large  Tl  is,  in  general,  not  a  good  measure  of  affinity,  except  possibly 
for  very  close-by  data  points.  This  is  the  reason  for  the  introduction  of  the  ‘closeness’  parameter  8  in  the 
formula  below.  The  premise  is  that  the  Euclidean  distance  provides  a  meaningful  measure  of  ‘affinity’  for 
data  lying  closer  than  a  cutoff  distance  quantified  by  this  8  ;  and  is  meaningless  for  data  beyond  this  cutoff. 
One  of  the  main  contributions  is  to  find  an  embedding  space  such  that  the  Euclidean  distance  in  this  space 
is  truly  representative  of  the  closeness  (‘affinity’)  among  the  data. 

Mathematical  background 

Think  of  a  point  X{  in  Euclidean  space  as  representing  a  string  of  outputs  from  a  neuron  labeled  by  i 
(data  vector,  sensor  output  stream,  and  so  on).  A  matrix  of  local  affinities  can  be  constructed  as: 
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A,  =  [X,  ■  Xj],  :=  exp)- 0  -  X,  ■  X,) !e\ 

Ml=i 

The  strength  of  such  a  data-correlation  based  affinity  decays  rapidly  with  the  distance  of  outputs  (other 
data  affinities  are  possible,  including  chemical).  We  renormalize  this  matrix  to  a  Markov  matrix  A  (or  more 
precisely  A£  ),  with  sums  of  the  entries  of  each  row  equaling  one.  A  measures  local  similarities,  and 

corresponds  to  one  step  of  a  random  walk  on  the  data  [1**,2**,20];  its  powers  Af  correspond  to  propagation 
of  the  local  similarities  by  the  Markov  process  after  t  steps  (time)  of  the  random  walk.  This  random  walk 
on  the  data  gives  rise  to  a  geometric  diffusion  (analogous  to  the  derivation  of  the  diffusion  equation  from 
Brownian  motion).  For  large  t,  all  similarities  are  integrated  along  all  paths,  yielding  information  about 
global  structures  in  the  data.  Remarkably,  these  can  be  efficiently  computed:  let  (pl(i)  =  (pl(Xi )  be  the  t 

eigenvector  of  A  evaluated  at  data  point  i ,  satisfying  Acpl  {i^=  A}  V/(0  T  are  arranged  in  decreasing 
order).  Then 

A,(xi>XJ)='Lk?<Pi(Xi)<Pi(XJ) 

=  al(i,j)  =  al(Xi,XJ). 

We  consider  the  map 

x?  ->  a;(px(xi),x2t(p2(xi),...x<pm(xi))  =  x® 


called  the  ‘diffusion  map’,  embedding  into  Rm  at  time  t .  The  square  of  the  ‘diffusion  distance’  at  time  t , 
measuring  ‘divergence’  between  nodes  i  and  j  ,  is: 


y2 (i,  j)  =  at(i, i)  +  a, (j,  j)  ~  2at(i,  j)  =  ^  1* (<p,(i) -  (p,(j))2 


X 


(t)  _  j^(t) 


For  large  t  this  can  be  computed  very  accurately  using  only  the  corresponding  first  few  eigenfunctions, 

H  2t  ...  • 

because  only  a  few  of  the  terms  //  are  above  the  level  of  precision  of  interest  (Figure  1).  This  provides  a 

diffusion  map  embedding  of  output  data  into  a  new  low-dimensional  Euclidean  space,  converting  diffusion 
distance  on  the  data  points  into  Euclidean  distance  in  the  embedding  space. 

As  a  first  simple  example  of  data  reorganization  provided  by  the  diffusion  embedding,  we  consider  a 
sampled  geometric  hourglass  surface,  idealizing  a  set  of  data  points  with  two  weakly  connected  clusters,  see 
Figure  2.  We  embed  the  point  cloud  into  three-dimensional  Euclidean  space  so  that  the  diffusion  distance 
in  the  original  space  can  be  computed  as  the  ordinary  Euclidean  length  of  the  chord  connecting  them  in 
embedding  space.  Because  the  diffusion  is  slower  through  the  bottleneck,  the  two  components  are  farther 
apart  in  the  diffusion  metric. 

In  Figure  3,  we  illustrate  the  organizational  ability  of  the  diffusion  maps  on  a  collection  of  images  given  in 
random  order.  The  inputs  are  2-D  gray  scale  pictures  of  the  object  in  ‘3D’  in  various  positions,  each  viewed 
as  a  32x32  =  1024  dimensional  vector.  To  calculate  the  embedding,  one  constructs  the  Markov  matrix 
as  above,  and  computes  the  first  few  eigenfunctions.  The  top  two  eigenfunctions  reveal  the  orientation  of 
‘3D’,  and  organize  the  data  accordingly,  see  Figure  3. 

Next,  we  organize  a  heterogeneous  material,  consisting  of  two  component  materials  (nodes,  represented  by 
circles  and  crosses),  possessing  different  conductivities  (Figure  4).  Although  the  gross  statistics  of  circles 
and  crosses  are  identical  on  both  lobes,  the  left  lobe  happens  to  have  more  highly  conductive  links,  which 
reduces  the  diffusion  distance  between  its  constituent  nodes.  The  left-to-right  bottleneck  increases  the 
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diffusion  distance  between  the  two  lobes,  because  there  are  fewer  paths  connecting  the  left  and  right  lobe. 
The  actual  long-time  affinity  structure  is  described  in  terms  of  the  eigenfunctions  (Figure  4):  on  the  left  all 
points  are  tightly  linked,  whereas  on  the  right  they  maintain  some  distance.  The  map  has  accounted  for  the 
preponderance  of  connections  through  all  paths  of  all  lengths  between  the  nodes. 

The  next  example  (Figure  5)  represents  an  organization  of  the  configuration  space  of  lip  images  that  arise 
from  a  single  speaker.  No  structure  is  assumed.  The  local  similarity  between  images,  viewed  as  high¬ 
dimensional  vectors,  organizes  them  as  above  in  the  first  three  diffusion  coordinates.  Different  locations  in 
the  diffusion  plot  correspond  to  different  clusters  of  strongly  related  lip  images. 

Dynamic  learning  through  diffusion  geometry 

We  now  use  these  ideas  to  describe  various  learning  methodologies  in  which  the  diffusion  mechanism  is 
iteratively  adjusted  to  improve  accuracy. 


First,  we  generalize  the  basic  affinity  matrix  to  enable  purely  empirical  and  dynamical  modeling  and 
learning. 

Assume  that  a  data  point  set  (sensor  output,  individual  neuron  output  strings,  and  so  on)  has  been 
generated  by  a  process,  the  local  statistical  characteristics  of  which  vary  from  location  to  location.  For  each 
point  X ,  we  view  the  neighboring  data  points  as  generated  by  a  local  unknown  diffusion  process,  the 
probability  density  of  which  is  estimated  by  Px(y )  =  CxQXp(—qx(x  —  y)) ,  where  qx  is  a  quadratic  form 

obtained  empirically  (for  example  by  local  principal  component  analysis  [21])  from  the  data  in  a  small 
‘neighborhood’  of  X . 


We  use  the  matrix  Px(y)Pz^y)~  to  model  the  corresponding  data-driven  diffusion.  The 

distance  defined  by  this  kernel  is  d (x, Z )  =  ( ^  | P x(y)  ~  P z(yf\  f1  >  which  can  be  viewed  as  the 

v 

natural  distance  on  the  ‘statistical  tangent  space’  to  the  point  cloud. 


In  a  dynamical  learning  situation  we  can  start  with  a  data  point  X ,  use  its  Euclidean  neighborhood  to 
define  Px(y)  at  X ,  then  find  the  Z  s  that  can  be  reached  from  X  to  compute  locally  a(x,  z).  We  then 

propagate  a  density  in  a  neighborhood  of  X  via  powers  of  A  ,  stopping  when  the  propagation  by  diffusion 
slows  down. 

When  labels  are  available,  separating  (a  subset  of)  the  data  in  different  classes,  the  information  they  provide 
can  be  incorporated  in  px  ,  by  locally  warping  the  metric  so  that  the  diffusion  starting  in  one  class  stays  in 

that  class  without  leaking  to  others.  This  could  be  obtained,  for  example,  by  using  any  kind  of  local 
discriminant  analysis  [21]  to  build  a  local  metric,  the  ‘fast’  directions  of  which  are  parallel  to  the  boundary 
between  classes  and  the  ‘slow’  directions  of  which  are  transverse  to  the  class  boundaries.  We  also  suggest 
that  an  iterative,  partially  supervised  procedure  can  lead  to  good  results  in  many  practical  situations. 

In  Figure  6  we  represent  a  diffusion  from  labeled  samples,  from  three  different  types  of  tissue,  seeking  to 
identify  all  related  samples  in  the  image.  Here,  each  pixel  has  an  absorption  spectrum,  with  128  spectral 
dimensions.  The  middle  image  shows  the  failure  of  conventional  ‘nearest  neighbor’  classification,  whereas 
the  diffusion  distance  yields  a  better  classification. 

Multiscale  analysis  of  diffusion  and  spectral  analysis 

Our  goal  is  to  replace  the  analytic  construction  of  the  eigenfunctions  by  direct  combinatorial  link 
organizations.  We  show  that  the  emergent  organization  discovered  above  with  the  help  of  the 
eigenfunctions  can  be  translated  into  a  multiscale  hierarchical  geometry  of  data  points.  This  point  of  view 
can  be  used  as  a  guide  for  theoretical  processing  models  in  biological  systems. 
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The  first  few  eigenfunctions  of  the  matrix  A  (or  equivalently,  of  the  Laplacian  on  a  graph  [13])  detect  and 
organize  global  structures  on  the  data-based  graph  [1,16].  It  is  often  the  case,  in  biological  and  other 
complex  systems,  that  several  organizational  structures  exist  at  different  ‘scales’.  Sensor  outputs  can  be 
grouped  (compressed)  into  ensembles  at  different  scales  of  complexity,  to  perform  tasks  at  different  levels 
of  complexity  or  abstraction,  and  integrating  the  tasks  performed  at  lower  levels  of  complexity. 

We  sketch  a  technique  for  constructing  these  sets  of  structures  at  different  scales  on  a  set  of  outputs  or 
data,  starting  from  the  finest  granularity,  and  building  up  to  more  complex  structures,  all  inter-related  at 
each  scale  and  across  scales,  culminating  in  the  global  structures  detected  and  described  by  the  analysis 
with  eigenfunctions  described  above.  In  the  case  of  clouds  of  data  points,  this  translates  into  a  multiscale 
analysis  of  the  cloud  of  points;  at  each  scale  we  have  a  set  of  aggregates  of  points,  and  relationships  among 
these  groups  are  determined  by  a  power  of  the  diffusion  operator  at  that  scale.  We  claim  (see  [2])  that  the 
embedding  provided  by  the  eigenfunctions  can  also  be  achieved  by  a  hierarchical  regrouping  of  data,  using 
affinity  at  different  diffusion  time  scales  as  a  grouping  mechanism. 

The  construction  alluded  to  above  is  most  easily  explained  in  terms  of  conventional  semantic  analysis  of 
text  documents,  each  document  being  a  data  point.  Each  document  has  coordinates  that  represent  the 
frequency  of  occurrence  of  words  in  it.  We  correlate  only  documents  with  strong  similarity  of  vocabulary. 
Given  a  document  x,  we  can  build  a  folder  around  it  of  documents  with  strong  immediate  affinity  (i.e. 
nearest  neighbors).  This  becomes  a  folder  at  ‘scale  1’.  To  obtain  a  folder  at  ‘scale  2’  we  consider  all 
documents,  y  ,  that  are  nearest  neighbors  to  a  nearest  neighbor  of  X  (i.e.  they  are  linked  by  a  chain  of 

length  2  to  X  ),  and  measure  affinity  as  the  sum  of  strength  of  all  these  chains  of  length  2  linking  y  to  X  ; 
we  keep  only  those,  y  ,  with  strong  affinity  to  form  a  folder  at  scale  2.  We  repeat  this  process  for  all  chains 

of  length  4  and  less.  One  can  easily  build  a  directory  structure  of  folders  at  all  dyadic  scales,  with  folders  at 
a  fixed  scale  being  disjoint.  From  our  point  of  view,  every  sensor  (every  neuron)  can  be  viewed  as  a 
document  for  which  a  string  of  sensor  outputs  are  the  coordinates  (elementary  semantic  content),  whereas 
the  folders  are  groups  of  outputs  combining  similar  or  highly  related  outputs  at  different  resolution  (or 
abstraction)  levels.  In  Figure  7  the  elementary  documents  are  various  6x6  patches  of  the  image  in  the  first 
panel.  The  folders  at  different  levels  of  resolution  correspond  to  higher  level  features  of  the  image. 

To  relate  this  description  to  a  mathematical  formulation  we  start  by  observing,  as  above  (Figure  1),  that  the 
numerical  rank  of  ( As)tls  decreases  rapidly  as  t  increases.  In  particular,  if  we  consider  the 

expansions  at  (x,  y)  =  lftls(pi  (x)(pi  (y) ,  for  t—  s2J  ?  obtained  by  successive  squaring,  then  for  any 

fixed  precision  the  summation  can  be  restricted  to  smaller  and  smaller  sets  of  indices. 

Secondly,  the  columns  at(x,y)  of  the  matrix  (  A.s)tls  represent  the  probability  of  transition  in  Esteps 
from  X  to  y  . 


We  can  also  interpret  the  X  column  of  the  matrix  A2  ,  a  j  (x,  y)  ,  as  a  rank  of  affinity  between  sensor 

(neuron)  output  X  and  sensor  (neuron)  output  y  at  scale  j ,  and  the  collection  of  points  y,  such  that 
CL  j  (x,y)  >  S  could  represent  all  sensor  (neuron)  outputs  y  similar  to  X  . 


We  present  a  very  simple  method  for  obtaining  a  hierarchical  ‘sensor  folder’  (or  ‘neuron  group’) 
organization,  as  described  above  for  the  text  documents.  A  minimal  collection  of  clusters  organizing  the 

whole  set  of  points  at  different  levels  of  granularity  is  obtained  as  follows:  let  {x^+1}  be  a  maximal 

subcollection  of  points  in  \xJk  }(key-points  at  scale  j  ),  such  that  1/2  ^  d  ^  j  (x  ^  ,  X  ^  ) 

where  {x^}  are  the  original  points.  Then  any  point  is  at  distance  at  most  1/2  at  scale  j  from  one  of  the 
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selected  ‘key-points’  at  that  scale,  enabling  us  to  create  a  document  folder  labeled  by  the  key-point.  It  is 
easy  to  modify  this  construction  to  obtain  a  tree  of  non-overlapping  folders. 

This  construction,  when  applied  to  text  documents  (equipped  with  semantic  coordinates),  builds  an 
automatic  folder  structure  with  corresponding  key  documents  characterizing  the  folders. 

A  detailed,  refined  construction  of  scaling  functions  (columns  of  A*  )  and  wavelets  representing  this 
multiscale  organization  of  the  graph  is  provided  in  Goifman  and  Maggioni  [2],  and  connections  with  related 
algorithms  in  numerical  analysis  in  Brandt  [22].  This  analysis  of  aggregation  at  different  times  (and 
corresponding  scales),  enables  us  to  perform  multiscale  wavelet  analysis  on  manifolds  and  graphs  in  a 
natural  way.  Applications  include  compression  of  functions  on  the  dataset,  denoising  of  such  functions,  and 
learning  (in  the  sense  of  classification  and  regression)  of  functions  on  the  dataset.  Although  the  description 
of  the  analysis  given  above  refers  only  to  organization  of  existing  data,  we  point  out  that  the  tools 
developed  also  enable  the  incorporation  of  new  data  points  into  the  structure  in  a  consistent  way,  and  the 
extension  of  functions  modeled  on  the  data  to  new  sensor  outputs  [1**,2**,4**]. 

The  multiscale  construction  enables  structure  to  emerge  at  different  scales  as  a  function  of  connectivity.  In 
Figure  7  we  show  several  small  patches  from  a  simple  image.  If  all  patches  are  considered,  edge  filters  (at 
the  finer  scales)  and  blob  filters  (at  the  coarser  scales)  naturally  arise.  Note  the  clear  curvature  in  their 
structure  [23].  Restricting  the  number  of  patches  would  result  in  more  VI -like  ‘receptive-fields’  [24-27]. 

Stochasticity  and  coherence 

Global  geometric  diffusions  can  be  applied  to  data  driven  by  a  Langevin  equation  [19**]  that  is  used  to 
model  many  biological  systems  [28-30],  for  example,  stochastic  unsynchronized  neuronal  pulse  trains.  The 
macroscopic  probability  density  behavior  of  such  systems  is  governed  by  the  Fokker-Planck  operator 
[19**],  the  eigenfunctions  of  which  can  be  empirically  approximated  as  described  above,  leading  to  efficient 
descriptions  of  likely,  long-time  probability  configurations  and  geometries  [2**,9,19].  The  connections 
between  Bayesian  learning  and  Fokker-Planck  equations  date  back  to  Verrelst  [31]  and  references  therein. 

Diffusion  wavelets  and  global  diffusion  have  both  been  applied  successfully  to  learning  processes  in  a 
variety  of  (stochastic)  environments,  where  an  agent  (e.g.  robot)  learns  optimal  behavior  for  achieving 
certain  tasks  from  past  experiences  [18**]. 

Conclusions 

Diffusion  geometries  can  reveal  structure  in  data  at  different  levels  of  organization.  Because  many  sources 
of  data  in  neuroscience  are  high-dimensional,  understanding  their  primary,  low-dimensional  intrinsic 
structure  can  be  insightful.  It  has  been  indicated  that  image  patch  structure  can  suggest  receptive  field 
properties,  and  that  different  properties  emerge  at  different  levels.  The  intrinsic  dimensionality  can  also  be 
useful  for  efficient  data  analysis.  Many  applications  of  these  techniques  in  neuroscience  remain  to  be  tried, 
from  spike  train  analysis  to  olfaction  and  the  electroencephalogram  (EEG).  But  perhaps  more  exciting  is 
the  possibility  that  emergent  structure  across  levels  will  open  a  theoretical  door  into  cognitive  neuroscience 
and  memory  organization. 

Matlab  scripts  for  the  computations  involved  in  diffusion  maps  and  multiscale  analysis  of  diffusion  are 
available  online  [32]  or  upon  request  from  M  Maggioni. 
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Figure  1 

The  spectra  of  powers  of  A  .  Some  examples  of  the  spectra  of  the  dyadic  powers  of  A  .  The  x  axis  is  the  index  of  the 
eigenvalue,  and  the  y  axis  the  eigenvalue  itself.  Eigenvalues  are  positive  and  are  arranged  in  nonincreasing  order. 

<a)  Original  djmbell  (b)  Embedding 


Figure  2 

Diffusion  embedding  of  a  sampled  hourglass  manifold,  (a)  An  original  set  of  points  sampled  on  a  hourglass  manifold,  as  a 
"""'del  for  two  weakly-connected  clusters  Cl  and  C2,  and  (b)  their  embedding  using  the  eigenfunctions  of  the  diffusion  matrix 
A  .  The  Euclidean  distance  in  image  in  (b)  is  equivalent  to  large-time  t  diffusion  distance  on  the  original  set  of  points  in  (a).  The 
two  ‘clusters’  get  flattened  and  move  further  apart  in  the  new  space.  The  axes  just  provide  a  reference  frame. 
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Figure  3 

Diffusion  embedding  of  a  set  of  pictures  of  “3D”.  Organization  emerging  from  a  collection  of  images  given  in  random  order 
(data  =  {xz}  ).  (a)  The  images  are  displayed  according  to  their  location  in  the  two-dimensional  diffusion  embedding 

(^1  (xz ),  (j>2  (xz ))  ,  displayed  in  (b).  The  coordinates  capture  (perceive)  the  orientation  of  the  picture  in  3D. 


(a) 


(b) 
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Figure  4 

Diffusion  embedding  of  a  heterogeneous  material,  (a)  A  heterogeneous  material  and  (b)  its  long-term  diffusion  embedding 
{(j) 2  (xz- ),  (f)^  (xz-  J)  .  This  structure  could  be  interpreted  as  a  map  of  trees  (circles)  and  shrubs  (crosses),  with  the  links 
representing  the  probability  of  fire  propagating  among  them.  From  (b)  it  is  clear  that  the  risk  of  fire  propagating  from  top  to 
bottom  is  higher  on  the  left  side  of  the  forest.  Color  is  included  so  that  points  can  be  matched  across  the  two  pictures. 


li 
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Figure  5 

Diffusion  embedding  of  images  of  lips.  The  lip  alphabet  is  learnt  from  a  set  of  pictures  of  the  lips  of  a  speaker.  The  manifold 
structure  and  its  parameters  are  parametrized  by  the  three  top  eigenfunctions  (axes  in  the  figure)  of  the  diffusion,  and  this 
parametrization  can  be  used  to  lip-read.  An  interpretation  of  the  low  order  eigenfunctions  is  openness  of  the  mouth  and 
exposure  of  teeth. 

(a)  (b)  (c) 
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Figure  6 

Classification  of  tissue  types  in  a  hyperspectral  image  through  diffusion,  (a)  A  slice  of  a  hyperspectral  image  with  three 
selected  regions  that  correspond  to  three  different  biologically  significant  types  of  tissue:  nuclei  (blue),  cytoplasm  of  epidermal 
cells  (pink)  and  collagen  in  the  underlying  dermis  (green),  (b)  Predictions  of  tissue  type  by  a  standard  nearest  neighbor 
classifier,  trained  on  the  set  in  (a),  (c)  Predictions  made  by  the  diffusion  classifier  described  above,  with  the  training  set 
represented  in  (a). 
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Figure  7 

Multiscale  folders,  (a)  Original  picture,  (b)  A  subset  of  6  X  6  pixel  patches  extracted  from  the  image,  (c)  A  folder  at  scale  2  is 
a  weighted  aggregate  of  patches,  representing  a  higher  level  feature,  (d)  Another  folder  at  scale  2  is  an  edge  detector,  (e  and 
f)  Two  folders  at  scale  3  that  represent  weighted  aggregates  of  patches  (‘attributes’  or  ‘features’)  at  an  even  coarser  scale. 
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Abstract 

Data  fusion  and  multi-cue  data  matching  are  fundamental  tasks  of  high-dimensional  data  analysis.  In  this 
paper,  we  apply  the  recently  introduced  diffusion  framework  to  address  these  tasks.  Our  contribution  is  three-fold. 
First,  we  present  the  Laplace-Beltrami  approach  for  computing  density  invariant  embeddings  which  are  essential 
for  integrating  different  sources  of  data.  Second,  we  describe  a  refinement  of  the  Nystrom  extension  algorithm 
called  “geometric  harmonics”.  We  also  explain  how  to  use  this  tool  for  data  assimilation.  Finally,  we  introduce  a 
multi-cue  data  matching  scheme  based  on  nonlinear  spectral  graphs  alignment.  The  effectiveness  of  the  presented 
schemes  is  validated  by  applying  it  to  the  problems  of  lip-reading  and  image  sequence  alignment. 


Index  Terms 

Pattern  matching,  graph  theory,  graph  algorithms,  Markov  processes,  machine  learning,  data  mining,  image 
databases. 


I.  Introduction 

The  processing  of  massive  high-dimensional  data  sets  is  a  contemporary  challenge.  Suppose  that  a 
source  s  produces  high-dimensional  data  {x\ ,...,xn}  that  we  wish  to  analyze.  For  instance,  each  data 
point  could  be  the  frames  of  a  movie  produced  by  a  digital  camera,  or  the  pixels  of  a  hyperspectral  image. 
When  dealing  with  this  type  of  data,  the  high-dimensionality  is  an  obstacle  for  any  efficient  processing  of 
the  data.  Indeed,  many  classical  data  processing  algorithms  have  a  computational  complexity  that  grows 
exponentially  with  the  dimension  (this  is  the  so-called  “curse  of  dimensionality”).  On  the  other  hand,  the 
source  s  may  only  enjoy  a  limited  number  of  degrees  of  freedom.  This  means  that  most  of  the  variables 
that  describe  each  data  points  are  highly  correlated,  at  least  locally,  or  equivalently,  that  the  data  set  has  a 
low  intrinsic  dimensionality.  In  this  case,  the  high-dimensional  representation  of  the  data  is  an  unfortunate 
(but  often  unavoidable)  artifact  of  the  choice  of  sensors  or  the  acquisition  device.  Therefore  it  should  be 
possible  to  obtain  low-dimensional  representations  of  the  samples.  Note  that  since  the  correlation  between 
variables  might  only  be  local,  classical  global  dimension  reduction  methods  like  Principal  Component 
Analysis  and  Multidimensional  Scaling  do  not  provide,  in  general,  an  efficient  dimension  reduction. 

First  introduced  in  the  context  of  manifold  learning,  eigenmaps  techniques  [1],  [2],  [3],  [4]  are  becoming 
increasingly  popular  as  they  overcome  this  problem.  Indeed,  they  allow  one  to  perform  a  nonlinear 
reduction  of  the  dimension  by  providing  a  parametrization  of  the  data  set  that  preserves  neighborhoods. 
However,  the  new  representation  that  one  obtains  is  highly  sensitive  to  the  way  the  data  points  were 
originally  sampled.  More  precisely,  if  the  data  are  assumed  to  approximately  lie  on  a  manifold,  then  the 
eigenmap  representation  depends  on  the  density  of  the  points  on  this  manifold  [5].  This  issue  is  of  critical 
importance  in  applications  as  one  often  needs  to  merge  data  that  were  produced  by  the  same  source  but 
acquired  with  different  devices  or  sensors,  at  various  sampling  rates  and  possibly  on  different  occasions.  In 
that  case,  it  is  necessary  to  have  a  canonical  representation  of  the  data  that  retains  the  intrinsic  constraints 
of  the  samples  (e.g.  manifold  geometry)  regardless  of  the  particular  distribution  of  the  datasets  sampled 
by  different  devices. 

1Google  Inc.,  stephane.lafon@gmail.com 
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Another  important  issue  is  that  of  data  matching.  This  question  arises  when  one  needs  to  establish  a 
correspondence  between  two  data  sets  resulting  from  the  same  fundamental  source.  For  instance,  consider 
the  problem  of  matching  pixels  of  a  stereo  image  pair.  One  can  form  a  graph  for  each  image,  where  pixels 
constitute  the  nodes,  and  where  edges  are  weighted  according  to  the  local  features  in  the  image.  The 
problem  now  boils  down  to  matching  nodes  between  two  graphs.  Note  that  this  situation  is  an  instance  of 
multi-sensor  integration  problem,  in  which  one  needs  to  find  the  correspondence  between  data  captured  by 
different  sensors.  In  some  applications,  like  fraud  detection,  synchronizing  data  sets  is  used  for  detecting 
discrepancies  rather  than  similarities  between  data  sets. 

The  out-of-sample  extension  problem  is  another  aspect  of  the  data  fusion  problem.  The  idea  is  to  extend 
a  function  known  on  a  training  set  to  a  new  point  using  both  the  target  function  and  the  geometry  of 
the  training  domain.  The  new  point  and  the  corresponding  value  of  the  function  can  then  be  assimilated 
to  the  training  set.  This  is  an  essential  component  in  any  scheme  that  agglomerates  knowledge  over  an 
initial  data  set  and  then  applies  the  inferred  structure  to  new  data.  Recently,  Belkin  et  al  have  developed 
a  solution  to  this  problem  via  the  concept  of  manifold  regularization  [6].  Earlier,  several  authors  used 
the  Nystrom  extension  procedure  in  the  Machine  Learning  context  [7],  [8]  in  order  to  extend  eigenmap 
coordinates.  In  both  cases,  the  question  of  the  scale  of  the  extension  kernel  remains  unanswered.  In  other 
words,  given  an  empirical  function  on  a  data  set,  to  what  distance  to  the  training  set  can  this  function 
be  extended  ?  In  particular,  given  the  spectral  embedding  of  the  data  set,  which  kernel  should  be  used  to 
extend  it? 

By  relating  the  frequency  content  of  the  target  function  on  the  training  set  to  the  extrinsic  Fourier 
analysis,  Coifman  et  al  provide  an  answer  to  this  question  [9].  They  developed  the  idea  of  “geometric 
harmonics”  based  on  the  Nystrom  extension  at  different  scales,  providing  a  multiscale  extension  scheme 
for  empirical  functions.  We  apply  this  concept  to  the  extension  of  spectral  embeddings  and  show  that  the 
extension  has  to  be  conducted  using  a  specially  designed  kernel  which  differs  from  the  eigenmap  kernel. 

In  this  article,  we  show  that  the  questions  discussed  above  can  be  efficiently  addressed  by  the  general 
diffusion  framework  introduced  in  [5],  [10],  [11],  The  main  idea  is  that,  just  like  for  eigenmaps  methods, 
eigenvectors  of  Markov  matrices  can  be  used  to  embed  any  graph  into  a  Euclidean  space  and  achieve 
dimension  reduction.  Building  on  these  ideas,  the  contribution  of  this  paper  is  three-fold: 

•  First,  we  show  that  by  carefully  normalizing  the  Markov  matrix,  the  embedding  can  be  made  invariant 
to  the  density  of  the  sampled  data  points,  thus  solving  the  problem  of  data  fusion  encountered  with 
other  eigenmaps  methods. 

•  Then,  we  address  the  problem  of  out-of-sample  extension,  and  we  explain  how  to  adaptively  extend 
empirical  functions  to  new  samples  using  the  geometric  harmonics.  In  particular  this  allows  us  to 
extend  the  diffusion  coordinates  to  new  data  points. 

•  Last,  we  take  advantage  of  the  density-invariant  representation  of  data  sets  provided  by  the  diffusion 
coordinates  to  derive  a  simple  data  matching  algorithm  based  on  geometrical  embeddings  alignment. 

The  proposed  scheme  is  experimentally  verified  by  applying  it  to  visual  data  analysis.  First,  we 
address  the  problem  of  automatic  lip-reading  by  embedding  the  lips  images  using  the  Laplace-Beltrami 
eigenfunctions  and  deriving  an  automatic  lip-reading  scheme  where  new  data  is  assimilated  using  geometric 
harmonics.  Second,  we  demonstrate  the  multi-cue  data  matching  aspect  of  our  work  by  matching  image 
sequences  corresponding  to  similar  head  motions. 

This  paper  is  organized  as  follows:  we  start  by  recalling  the  diffusion  framework,  and  the  notion  of 
diffusion  maps  in  Section  II-A.  We  then  explain  in  Section  II-B  how  to  normalize  the  diffusion  kernel  in 
order  to  separate  the  geometry  (constraints)  of  the  data  from  the  distribution  of  the  points.  We  describe  the 
out-of-sample  extension  procedure  via  the  geometric  harmonics  in  Section  II-C  and  present  a  nonlinear 
algorithms  for  matching  two  data  sets  in  Section  II-D.  Last,  we  illustrate  these  ideas  by  applying  it  to 
lip-reading  and  sequence  alignment  in  Section  III. 
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II.  The  diffusion  framework 


We  start  by  reviewing  the  density-invariant  embedding  and  out-of-sample  extension  schemes  (previously 
introduced  in  [5]  and  [9])  in  Sections  II-B  and  II-C,  respectively.  To  exemplify  their  applicability  to  high¬ 
dimensional  data  processing  and  learning,  we  apply  them  to  derive  a  novel  high-dimensional  data  alignment 
algorithm  in  Section  II-D. 


A.  Diffusion  maps  and  diffusion  distances 

Let  fl  =  . . . .  x„ }  be  a  set  of  n  data  points.  In  this  section,  we  recall  the  diffusion  framework  as 

described  in  [5],  [12],  [13].  The  main  point  of  this  set  of  techniques  is  to  introduce  a  useful  metric  on 
data  sets  based  on  the  connectivity  of  points  within  the  graph  of  the  data,  and  also  to  provide  coordinates 
on  the  data  set  that  reorganize  the  points  according  to  this  metric. 

The  first  step  in  our  construction  is  to  view  the  data  points  fl  =  {xi, ....  xn }  as  being  the  nodes  of 
a  symmetric  graph  in  which  any  two  nodes  Xi  and  Xj  are  connected  by  an  edge.  The  strength  of  this 
connection  is  measured  by  a  non-negative  weight  w(xi,Xj )  that  reflects  the  similarity  between  xL  and  Xj. 
The  very  notion  of  similarity  between  two  data  points  is  completely  application-driven.  In  many  situations 
however,  each  data  point  is  a  collection  of  continuous  numerical  measurements  and,  maybe  after  rescaling 
some  of  the  features,  it  can  be  thought  of  as  a  point  in  a  Euclidean  feature  space.  In  this  case,  similarity 
can  be  measured  in  terms  of  closeness  in  this  space,  and  it  is  custom  to  weight  the  edge  between  Xi  and 
Xj  by  exp(— ||xj  —  Xj\\2 /e),  where  e  >  0  is  a  scale  parameter.  This  choice  corresponds  to  the  belief  that 
the  only  relevant  information  lies  in  local  distance  measurements.  Indeed,  Xi  and  Xj  will  be  numerically 
connected  if  they  are  sufficiently  close.  In  diffusion  kernels,  graphs  represent  the  structures  of  the  input 
spaces,  and  the  vertices  are  the  objects  to  be  classified.  In  addition,  Belkin  and  Niyogi  [2]  explain  that,  in 
the  case  of  a  data  set  approximately  lying  on  a  submanifold,  this  choice  corresponds  to  an  approximation 
of  the  heat  kernel  on  the  submanifold.  Last,  in  [5],  it  is  shown  that  any  weight  of  the  form  h{\\xi  —  x3 1 1 2 ) 
(where  h  decays  sufficiently  fast  at  infinity)  allows  to  approximate  the  heat  kernel. 

More  generally,  we  allow  ourselves  to  consider  arbitrary  weight  functions  that  verify  the  following 

two  conditions1,  for  all  x  and  y  in  fk 

•  it  is  symmetric:  w(x,y )  =  w(x,y), 

•  it  is  pointwise  non-negative:  w(x,y )  >  0. 

This  level  of  generality  allows  to  take  into  account  the  case  when  data  points  are  represented  by  a 
collection  of  categorical  features.  In  this  situation,  it  can  be  useful  to  employ  a  Gaussian  kernel  with 
a  Hamming  distance.  But  rather  than  to  give  a  list  of  recipes,  we  would  like  to  underline  the  fact  that 
the  choice  of  the  weight  function  should  be  entirely  application-driven.  The  weight  function  or  kernel 
describes  the  first-order  interaction  between  the  data  points  as  it  defines  the  nearest  neighbor  structures  in 
the  graph.  It  should  capture  a  notion  of  similarity  as  meaningful  as  possible  with  respect  to  the  application, 
and  therefore  could  very  well  take  into  account  any  type  of  prior  knowledge  on  the  data.  The  analysis  of 
the  data  provided  by  the  diffusion  techniques  depends  heavily  on  the  choice  of  the  weight  function.  Last, 
note  that  the  only  real  requirement  for  our  technique  to  be  applicable  is  to  be  able  to  define  a  local  notion 
of  similarity  between  the  point.  In  other  words,  one  must  be  able  to  answer  the  question  of  whether  two 
points  are  (very)  similar  or  not.  This  is  a  much  simpler  question  than  having  to  define  a  global  distance 
between  all  pairs  of  points. 

Following  a  classical  construction  in  spectral  graph  theory  [15],  namely  the  normalized  graph  Laplacian, 
we  now  create  a  random  walk  on  the  data  set  0  by  forming  the  following  kernel: 


pi(x>y) 


w{x,y) 
d(x)  ’ 


where  d(x)  =  w(x-> z)  1S  the  degree  of  node  x. 


'Since  w(-,  •)  is  supposed  to  represent  the  similarity  between  data  points,  it  will  be  fair  to  assume  that  w(x,  x)  >  0 
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Since  we  have  that  pi(x,  y)  >  0  and  Y2yen Pi  (xi  v)  =  1>  the  quantity  pi(x,  y )  can  be  interpreted  as  the 
probability  for  a  random  walker  to  jump  from  x  to  y  in  a  single  time  step.  If  P  is  the  n  x  n  matrix  of 
transition  of  this  Markov  chain,  then  taking  powers  of  this  matrix  amounts  to  running  the  chain  forward 
in  time.  Let  pt(\  •)  be  the  kernel  corresponding  to  the  tth  power  of  the  matrix  P.  In  other  words,  pt(-,  •) 
describes  the  probabilities  of  transition  in  t  time  steps. 

The  asymptotic  behavior  of  this  random  walk  has  been  used  to  find  clusters  in  the  data  set  [15],  [16], 
[17],  where  the  first  non-constant  eigenfunction  is  used  as  a  classification  function  into  two  clusters.  This 
was  justified  as  a  relaxation  of  a  discrete  problem  of  finding  an  optimal  cut  in  a  graph  [16],  This  approach 
was  later  generalized  to  using  more  eigenvectors  in  order  to  compute  a  larger  number  of  clusters  (see  for 
instance  [18],  [19],  [13]).  Several  papers  form  machine  learning  (in  particular  [14])  have  underlined  the 
connections  and  applications  of  the  graph  Laplacian  to  machine  learning.  Within  the  manifold  learning 
community,  the  first  few  eigenvectors  of  this  Markov  chain  have  been  employed  for  dimensionality 
reduction.  In  [20],  [2]  Belkin  and  Niyogi  showed  that  when  data  is  uniformly  sampled  from  a  low¬ 
dimensional  manifold,  the  first  few  eigenvectors  of  P  are  discrete  approximations  of  the  eigenfunctions 
of  the  Laplace-Beltrami  operator  on  the  manifold,  thus  providing  a  mathematical  justification  for  their 
use  in  this  case. 

If  the  graph  is  connected,  then  for  t  =  +oo  this  Markov  chain  is  governed  by  a  unique  stationary 
distribution  do  (see  appendix  I),  which  means  that  for  all  x  and  y, 


lim  pt(x,y)  =  Mv ) . 

t^+OO 

The  vector  do  is  the  top  left  eigenvector  of  P,  i.e.,  <$ P  =  dff  and  it  can  be  verified  that  do(y)  is  given 

by 


Mv)  = 


d(y ) 


nd(z)  ' 

The  pre-asymptotic  regime  is  governed  according  to  the  following  eigendecomposition  [12]: 

Pt(x,y)  =  ^2  \Mx)<f>i(y) »  (!) 


i>  o 


where  {A/}  is  the  sequence  of  eigenvalues  of  P  (with  |A0|  >  |Ai|  >  ...)  and  {di\  and  {ily}  are  the 
corresponding  biorthogonal  left  and  right  eigenvectors  (see  appendix  II  for  a  proof).  Furthermore,  because 
of  the  spectrum  decay,  only  a  few  terms  are  needed  to  achieve  a  given  relative  accuracy  5  >  0  in  the 
previous  sum. 

Unifying  ideas  from  Markov  chains  and  potential  theory,  the  diffusion  distance  between  two  points  x 
and  z  was  introduced  in  [12],  [5]  as 


Dt(x,  z) 


E 


(pt(x,y) 

Mv) 


(2) 


This  quantity  is  simply  a  weighted  L2  distance  between  the  conditional  probabilities  pt(x,-),  and  pt(z,  •). 
These  probabilities  can  be  thought  of  as  features  attached  to  the  points  x  and  2,  and  they  measure  the 
influence  or  interaction  of  these  two  nodes  with  the  rest  of  the  graph. 

By  increasing  t,  one  propagates  the  local  or  short-term  influence  of  each  node  to  its  nearest  neighbors, 
and  this  means  that  t  also  plays  the  role  of  a  scale  parameter.  The  comparison  of  these  conditional 
probabilities  introduces  a  notion  of  proximity  that  accounts  for  the  connectivity  of  the  points  in  the  graph. 
In  particular,  unlike  the  shortest  path,  or  geodesic  distance,  this  metric  is  robust  to  noise  as  it  involves  an 
integration  along  all  paths  of  length  t  starting  from  x  or  2.  Empirical  evidence  supporting  this  claim  is 
provided  in  [13].  The  diffusion  distance  incorporates  the  notions  of  mixing  time  and  clusterness  used  in 
classical  graph  theory  [21]. 
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The  connection  between  the  diffusion  distance  and  the  eigenvectors  goes  as  follows  (see  appendix  II): 

D2t(x,z)  =  -xpiiz))2.  (3) 

i>  1 

Note  that  ipo  does  not  appear  in  the  sum  because  it  is  constant.  This  identity  means  that  the  right 
eigenvectors  can  be  used  to  compute  the  diffusion  distance.  The  diffusion  distance  therefore  generalizes 
the  use  of  the  eigenvectors  for  finding  bottlenecks  and  clusters  in  the  graph  [21],  and  extends  this  approach 
by  taking  into  account  more  than  just  the  second  largest  eigenvalue. 

Furthermore,  and  as  mentioned  before,  because  of  the  spectrum  decay,  only  a  few  terms  are  needed  to 
achieve  a  given  relative  accuracy  <5  >  0  in  the  previous  sum.  Let  m(t)  be  the  number  of  terms  retained, 
and  define  the  diffusion  map 

Vt  ■  x  y — »  (Aj^i(x),  X\ip2(x),  ■  ■  •  ,^(i)i(t)(i))T  •  (4) 

This  mapping  provides  coordinates  on  the  data  set  Cl,  and  embeds  the  n  data  points  into  the  Euclidean 
space  RmW.  In  addition,  the  spectrum  decay  is  the  reason  why  dimension  reduction  can  be  achieved. 
This  method  constitutes  a  universal  and  data-driven  way  to  represent  a  graph  or  any  generic  data  set  as  a 
cloud  of  points  in  a  Euclidean  space.  We  also  obtain  a  complete  parametrization  of  the  data  that  captures 
relevant  modes  of  variability.  Moreover,  the  dimension  m(t )  of  the  new  representation  only  depends  on  the 
properties  of  the  random  walk  on  the  data,  and  not  on  the  number  of  features  of  the  original  representation 
of  the  data.  In  particular,  if  we  increase  t,  then  rn(t)  decreases  and  we  capture  larger-scale  structures  in 
the  data. 

B.  Data  merging  using  the  Laplace-Beltrami  normalization 

We  now  direct  our  attention  to  the  case  when  the  original  data  points  Cl  =  {oq, ...,  xn}  are  assumed2  to 
approximately  lie  on  a  submanifold  M.  of  Rf/.  The  so  called  “manifold  model”  holds  for  a  large  variety  of 
situations,  such  as  when  the  data  is  produced  by  a  source  controlled  by  a  few  free  continuous  parameters. 
For  instance,  consider  the  rotation  of  a  human  head  and  the  lips  motion  of  a  speaker.  We  will  study  these 
examples  later  in  this  paper. 

On  the  manifold  M,  the  data  points  were  sampled  with  a  density  q(-)  that  may  reflect  some  important 
aspect  of  the  phenomenon  that  generated  the  data.  For  instance,  as  described  in  [12],  for  some  data  sets, 
the  density  is  related  to  the  free  energy  surface  that  governs  the  samples.  On  the  other  hand,  the  density 
may  depend  on  the  acquisition  process  and  may  be  unrelated  to  intrinsic  geometry  or  dynamics  of  the 
underlying  phenomenon.  In  this  situation,  the  distribution  of  the  points  is  an  artifact  of  the  sampling 
process,  and  consequently,  any  “good”  representation  of  the  data  should  be  invariant  to  the  density. 

Classical  eigenmap  methods  provide  an  embedding  that  combines  the  information  of  both  the  density 
and  geometry.  For  instance,  with  the  Laplacian  eigenmaps  [2],  one  starts  by  forming  the  graph  with 
Gaussian  weights  w£(x,y )  =  exp(  — ||x  —  y\\2/e),  and  then  constructs  the  random  walk  as  described  in 
the  previous  section.  The  eigenvectors  are  then  used  to  embed  the  data  set  into  a  Euclidean  space.  It  was 
shown  in  [5]  that  in  the  large  sample  limit  n  — »  +oo  and  small  scale  e  — »  0,  the  eigenvectors  tend  to 
those  of  the  Schrodinger  operator  A  +  E,  where  A  is  the  Laplace-Beltrami  operator  on  M,  and  E  is  a 
scalar  potential  that  depends  on  the  density  q.  As  a  consequence,  the  Laplacian  eigenmaps  representation 
of  the  data  heavily  depends  on  the  density  of  the  data  points.  In  particular,  it  makes  it  impossible  to  fuse 
two  data  sets  obtained  from  the  same  sensors  but  with  different  densities. 

In  order  to  solve  this  problem,  we  suggest  to  renormalize  the  Gaussian  edge  weights  w£  ( • ,  • )  with  an 
estimate  of  the  density  and  to  form  the  random  walk  on  this  new  graph.  This  is  summarized  in  Algorithm 
1. 


2Note  that  the  density  normalization  that  we  describe  in  this  section  can  be  applied  to  more  general  structures  such  as  a  cloud  of  points. 
In  this  case,  the  diffusion  coordinates  will  be  invariant  to  the  density  of  the  points  within  this  cloud. 
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Algorithm  1  Approximation  of  the  Laplace-Beltrami  diffusion 

i:  Start  with  a  rotation-invariant  kernel  w£(x,  y)  =  h 
2:  Let 

Qe(x)  =  ^w£(x,y) , 

yEfl 


and  form  the  new  kernel 


we{x,y) 


ws{x,y) 

Qe(x)q£(y)  ' 


3:  Apply  the  normalized  graph  Laplacian  construction  to  this  kernel,  i.e.,  set 


d£(x)  =  J ^w£(x,y ) , 


and  define  the  anisotropic  transition  kernel 

Pe(x,y) 


We{x,y) 

d£{x) 


(5) 


Let  P£  be  the  transition  matrix  with  entries  p£  (•,•)•  The  asymptotics  for  P£  are  given  in  the  following 
theorem. 

Theorem  1:  In  the  limit  of  large  sample  and  small  scales,  we  have 

lim  lim  - -  =  A  . 

£ — >0  n^+oo  £ 

In  particular,  the  eigenvectors  of  P£  tend  to  those  of  the  Laplace-Beltrami  operator  on  M.  We  refer  to 
[5]  for  a  proof.  A  similar  analysis  for  the  case  of  a  uniform  density  q  =  1  is  provided  in  [2],  [22]. 

This  result  shows  that  the  diffusion  embedding  that  one  obtains  from  an  appropriately  renormalized 
Gaussian  kernel  does  not  depend  on  the  density  q  of  the  data  points  of  A4.  This  algorithm  allows  one  to 
successfully  capture  the  nonlinear  constraints  governing  the  data,  independently  from  the  distribution  of 
the  points.  In  other  words,  it  separates  the  geometry  of  the  manifold  from  the  density. 

C.  Out-of-sample  extension  and  the  geometric  harmonics 

In  most  applications,  it  is  essential  to  be  able  to  extend  the  low-dimensional  representation  computed 
on  a  training  set  to  new  samples.  Let  fi  be  a  data  set  and  be  its  diffusion  embedding  map.  We  now 
present  the  geometric  harmonic  scheme  that  allows  us  to  extend  to  a  new  data  set  Q.  Since  we  need 
to  relate  the  new  samples  to  the  training  set,  we  will  assume  that  fl  is  a  subset  of  a  Euclidean  space  Wl. 

As  mentioned  in  the  introduction,  the  Nystrom  extension  method  is  a  popular  technique  employed  in 
the  machine  learning  community  [7],  [8]  for  the  extension  of  empirical  functions  from  the  training  set  to 
new  samples.  As  we  discuss  later,  this  method  suffers  from  several  drawbacks,  and  the  scheme  that  we 
present  in  this  section  aims  at  solving  these  problems. 

For  the  sake  of  completeness,  we  first  recall  the  idea  of  Nystrom  extension  [23].  We  then  point  out  its 
weaknesses,  present  our  geometric  harmonics  extension  scheme  and  explain  how  it  solves  the  problems  of 
the  Nystrom  extension.  Let  a  >  0  be  a  scale  of  extension,  and  consider  the  eigenvectors  and  eigenvalues 
of  a  Gaussian  kernel3  of  width  a  on  the  training  set  il: 

A HTi(x)  =  5>-"*-l|2/^(</)  where  x  efl. 
yen 

3 In  order  to  simplify  our  presentation  of  the  extension  algorithm,  we  choose  to  work  with  a  Gaussian  kernel.  In  general,  one  can  use  any 
symmetric  kernel  with  an  exponential  decay. 
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Since  the  kernel  can  be  evaluated  in  the  entire  space,  it  is  possible  to  take  any  x  e  Rd  in  the  right-hand 
side  of  this  identity.  This  yields  the  following  definition  of  the  Ny strom  extension  of  <pi  from  il  to  Rd: 

y>i{x)  =  —  e-\\x~y\\ 2/cr2^(y)  where  x  e  Rd  .  (6) 

Note  that  </?;  is  being  extended  to  a  distance  proportional  to  a  from  the  training  set  il.  Beyond  this  distance, 
the  extension  numerically  vanishes. 

We  now  know  how  to  extend  the  eigenfunctions  of  the  kernel,  and  since  these  eigenfunctions  form  a 
basis  of  the  set  of  functions  on  the  training  set,  any  function  /  on  the  training  set  can  be  decomposed  as 
the  sum 

f(x)  =  f)  <pi(x)  where  x  e  0  , 

i 

and  we  can  define  the  Nytstrom  extension  of  /  to  the  rest  of  Rd  to  be 

J(x)  -^2(<Pi,  f)  ¥i(x)  where  x  eRd .  (7) 

i 

This  scheme  seems  very  attractive,  but  it  raises  the  question  of  the  choice  of  the  kernel  of  extension.  In 
our  exposition  above,  we  considered  a  Gaussian  of  width  a,  which  implies  that  functions  will  be  extended 
to  a  distance  proportional  to  a  (the  extension  numerically  vanishes  beyond  a  multiple  of  this  distance). 
Classically  (see  [7],  [8]),  when  extending  eigenmaps,  the  kernel  being  used  for  the  extension  is  the  same 
as  the  one  employed  for  the  computation  of  the  eigenmaps  on  the  training  set.  The  focal  point  of  the 
extension  scheme  that  we  now  present  is  precisely  to  contradict  this  approach.  Indeed,  when  computing 
the  diffusion  embedding  or  any  other  type  of  Laplacian  eigenmap,  one  strives  for  using  as  small  a  scale 
y/s  as  possible.  The  reason  behind  this  is  that,  as  shown  in  Theorem  1  and  in  [2],  [22],  [5],  in  the  limit 
of  small  scales,  the  diffusion  maps  approximate  the  eigenvectors  of  the  Laplace-Beltrami,  allowing  to 
capture  the  geometry  of  the  underlying  structure  of  the  data  set  (such  as  the  manifold  geometry  if  there  is 
an  underlying  manifold).  On  the  contrary,  when  extending  the  diffusion  coordinates  off  the  training  set, 
it  is  our  interest  to  extend  them  as  far  as  possible  in  order  to  maximize  their  generalization  power.  This 
has  two  consequences: 

•  The  scale  a  of  the  kernel  used  for  extending  should  be  as  large  as  possible. 

•  This  scale  should  not  be  the  same  for  all  functions  that  we  are  trying  to  extend.  Indeed,  we  expect 
the  scale  of  extension  to  be  related  to  the  complexity  of  the  function  to  be  extended.  Low-complexity 
functions  should  be  easy  to  extend  very  far  from  the  training  set.  For  instance  the  constant  function 
on  il  is  the  simplest  function  on  the  training  set,  and  should  be  extendable  to  the  entire  space  Rd. 
On  the  contrary,  a  function  with  wild  variations  on  0  should  have  a  limited  range  of  extension,  as 
their  values  off  the  training  set  are  more  difficult  to  predict. 

These  two  observations  give  rise  to  the  idea  of  adapting  the  scale  of  extension  (and  hence  the  kernel) 
to  the  function  /  to  be  extended.  Therefore,  all  we  need  now  is  a  criterion  for  determining  the  maximum 
scale  of  extension  for  /.  To  this  end,  fix  a  >  0,  and  observe  that  in  Equation  6,  /q  — >  0  as  l  — »  +oo, 
which  implies  that  the  Nystrom  extension  scheme  described  by  Equation  7  is  ill-conditioned.  Of  course, 
we  can  circumvent  this  problem  if,  in  the  same  sum,  we  only  retain  the  terms  corresponding  to  //0  / Hi 
smaller  than  a  given  threshold  77  >  0: 

J(x)  =  ^2  (<#>  /)  Wi(x)  where  x  eRd .  (8) 

This  way,  the  extension  procedure  has  a  condition  number  less  than  to  77,  and  this  variable  plays  the  role 
of  a  regularization  parameter.  However,  /  and  /  no  longer  coincide  on  Q,  which  means  that  /  is  no  longer 
an  extension  of  /.  This  is  precisely  the  basis  of  decision  about  the  scale  a:  if  it  turns  out  that  the  difference 
between  /  and  /  on  fl  is  still  acceptable  (as  measured  by  the  reconstruction  error),  then  this  means  that 
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/  is  extendable  at  a  distance  a  from  Q.  Otherwise,  it  means  that  a  needs  to  be  reduced.  Indeed,  if  we 
decrease  the  value  of  a,  then  the  kernel  of  extension  becomes  finer,  and  its  eigenvalues  will  decay  more 
slowly.  This  allows  the  sum  in  Equation  8  to  contain  more  terms,  and  /  to  be  a  better  approximation  of 
/  on  0.  This  geometric  harmonics  technique  formalizes  these  observations  into  a  scheme  presented  in 
Algorithm  2. 


Algorithm  2  Multiscale  extension  scheme  of  diffusion  coordinates  via  geometric  harmonics 

i:  Let  fl  C  M.d  be  the  training  set  and  /  =  xpi  :  fl  — >  E  be  the  diffusion  coordinate  to  be  extended 
(1  <  i  <  m{t)).  Choose  a  condition  number  r\  >  0  and  an  admissible  error  r  >  0. 

2:  Choose  an  initial  (large)  scale  of  extension  a  =  aQ. 

3:  Compute  the  eigenfunctions  of  the  Gaussian  kernel  with  width  a  on  the  training  set  Q: 

Hm{x)  =  ^e_l|a:_2/l|2/<TVi(?/)  where  x  G  Cl, 

2/eo 

and  expand  /  on  this  orthonormal  basis  (on  the  training  set  fl): 

f(x)  =  ci<pi(x)  where  x  G  fl . 
i>  o 


4:  Compute  the  error  of  reconstruction  on  the  training  set  that  one  obtains  by  retaining  only  the 
coefficients  such  that  y  >  y0/ hi  in  the  sum  above: 


Err  = 


E  m2 


If  Err  >  t  then  divide  a  by  2  and  go  back  to  point  3.  Otherwise  continue. 
5:  For  each  l  such  that  rj  >  IM)/ !M,  extend  (pi  via  the  Nystrom  procedure: 


1 

IM 


Ee 


\x  y\\2 E2 where  x  e  Md, 


and  define  the  extension  /  of  /  to  be 

f(x )  =  cEPi(x)  where  x  £  . 

i>  o 


To  summarize  our  ideas,  if  we  increase  the  scale  of  extension,  then  the  error  of  reconstruction  on  Q 
will  increase.  Hence,  the  reconstruction  error  limits  the  maximal  extension  range.  In  fact,  this  limitation 
can  be  regarded  as  relating  the  complexity  of  the  function  on  the  training  set  to  the  distance  to  which  it 
can  be  extended  off  this  set.  Here,  the  notion  of  complexity  is  measured  in  terms  of  frequency  content 
on  the  training  domain.  For  instance,  a  constant  function  has  almost  no  complexity  and  one  should  be 
able  to  extend  it  in  the  entire  space.  If  the  number  of  oscillations  of  this  function  increases,  then  the 
distance  to  which  one  can  extend  it  gets  smaller.  This  illustrated  on  Figure  1.  The  geometric  harmonics 
are  therefore  perfectly  appropriate  for  extending  the  diffusion  coordinates  to  new  samples  as  higher-order 
and  lower-order  diffusion  coordinates  do  not  have  the  same  number  of  oscillations. 

D.  Multi-cue  alignment  and  data  matching 

The  purpose  of  this  section  is  to  explain  how  the  diffusion  embedding  can  be  efficiently  used  for  data 
matching.  Suppose  that  one  has  two  data  sets  fli  =  {xi,...,xn}  and  =  {yi,...,yn>}  for  which  one 
would  like  to  find  a  correspondence,  or  detect  similar  patterns  and  trends,  or  on  the  contrary,  underline 
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Fig.  1.  Extension  of  two  functions  from  the  unit  circle  to  R2.  The  function  on  the  left  is  very  smooth  on  the  training  set,  and  therefore 
can  be  extended  far  away  from  it.  On  the  contrary,  the  function  on  the  right  oscillates  much  on  the  training  set,  and  this  limits  its  scale  of 
extension. 


their  dissimilarity  and  detect  anomalies.  This  type  of  task  is  very  common  in  applications  related  to 
marketing,  automatic  machine  translation,  fraud  detection  or  even  counter-terrorism.  However,  working 
with  the  data  in  its  original  form  can  be  quite  difficult  as  the  two  sets  typically  consist  of  measurements 
of  very  different  nature.  For  instance  could  be  a  collection  of  measurements  related  to  wether  in  a 
given  region,  whereas  fi2  could  describe  agriculture  production  in  the  same  region.  As  a  consequence, 
it  is  almost  always  impossible  to  directly  compare  the  two  data  sets,  simply  because  they  might  not  be 
represented  using  the  same  type  of  features.  The  main  idea  that  we  introduce  here  is  that  the  diffusion  maps 
provide  a  canonical  representation  of  data  sets  reflecting  their  intrinsic  geometry.  This  new  representation 
is  based  on  the  graph  structure  of  a  set,  that  is,  the  neighbor  relationship  between  points,  and  not  on  their 
original  feature  representation.  As  a  consequence,  instead  of  comparing  the  data  sets  in  their  original 
forms,  it  can  be  much  more  efficient  to  compare  their  embeddings.  In  particular,  if  Q  i  and  fi2  are  expected 
to  have  similar  intrinsic  geometry  structures,  then  they  should  have  similar  embeddings. 

There  has  been  a  body  of  work  related  to  graph  based  manifold  alignment.  Gori  et.  al  [24]  align  weighted 
and  unweighted  graphs  by  computing  a  ‘signature’  for  each  node  that  is  based  on  repeated  use  of  the 
invariant  measure  of  different  Markov  chains  defined  on  the  data.  The  nodes/samples  are  then  matched 
in  two  ways.  First,  in  a  one-by-one  basis,  where  nodes  with  similar  signatures  are  coupled.  Second,  in  a 
globally  optimal  approach  using  a  bipartite  graph  matching  scheme.  Ham  et.  al  [25]  align  the  manifolds, 
given  a  set  of  a-priori  corresponding  nodes  or  landmarks.  A  constrained  formulation  of  the  graph  Laplacian 
based  embeddings  is  derived  by  including  the  given  alignment  information.  First,  they  add  a  term  fixing  the 
embedding  coordinates  of  certain  samples  to  predefined  values.  Both  sets  are  then  embedded  separately, 
where  certain  samples  in  each  set  are  mapped  to  the  same  embedding  coordinates.  Second,  they  describe 
a  dual  embedding  scheme,  where  the  constrained  embeddings  of  both  sets  are  computed  simultaneously, 
and  the  embeddings  of  certain  points  in  both  datasets  are  constrained  to  be  identical.  The  work  of  Bai  et. 
al  [26]  presents  a  similar  framework  to  our  scheme.  The  ISOMAP  algorithm  is  used  to  embed  the  nodes 
of  the  graphs  corresponding  to  the  aligned  datasets,  in  a  low-dimensional  Euclidean  space.  The  nodes  are 
thus  transformed  into  points  in  a  metric  space,  and  the  graph-matching  is  recast  as  the  alignment  of  point 
sets.  A  variant  of  the  Scott  and  Longuet-Higgins  algorithm  is  then  used  to  find  point  correspondences. 
An  approach  to  Many-to-Many  alignment  was  presented  in  [27]  by  Keselman  et.  al.  They  aim  to  match 
corresponding  clusters  of  nodes  in  both  datasets,  rather  then  match  individual  nodes.  The  datasets  are 
embedded  in  a  metric  space  using  the  Matousek  embedding  and  sets  of  nodes  are  then  aligned  using  the 
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Earth  Mover’s  Distance,  which  is  a  distribution-based  similarity  measure  for  sets. 

In  the  data  alignment  segment  of  our  work,  we  resolve  the  alignment  of  datasets  with  a  common  low¬ 
dimensional  manifold,  but  different  densities,  by  incorporating  the  use  of  the  density-invariant  embedding. 
This  issue  was  overlooked  in  previous  works  based  on  spectral  embeddings  [24],  [25],  [26],  [27],  although 
spectral  and  ISOMAPS  embeddings  are  highly  sensitive  to  the  way  the  data  points  were  originally  sampled. 
Hence,  the  underlying  assumption  in  [24],  [25],  [26],  [27]  that  the  low-dimensional  embedding  of  datasets 
sharing  a  common  low-dimensional  manifold  will  be  similar,  might  prove  invalid. 

In  addition  to  dealing  with  the  density  issue,  we  present  a  semi- supervised  algorithm  for  finding  a 
one-to-one  correspondence  between  two  data  sets.  The  scheme  we  introduce  consists  in  aligning  two 
graphs  in  a  nonlinear  fashion,  based  on  a  finite  number  of  landmarks  (matching  points  or  nodes).  The 
main  idea  is  to  lift  each  graph  into  the  same  diffusion  space,  and  to  align  the  resulting  clouds  of  points 
using  a  simple  affine  matching4.  The  diffusion  maps  provide  a  nonlinear  reduction  of  dimensionality,  and 
therefore  our  scheme  is  appropriate  for  the  alignment  of  high-dimensional  data  sets  with  low-intrinsic 
dimensionality.  In  addition,  as  explained  in  the  previous  sections,  if  we  use  the  density-invariant  diffusion 
maps,  the  alignment  scheme  will  be  insensitive  to  the  different  distributions  of  points  of  the  two  data  sets. 

As  for  the  notations,  suppose  that  we  have  k  <  n,  n'  landmarks  in  each  set,  that  is  a  sequence  of  k  pairs 
(aV(i),  yr(i)), ...,  (avpfc),  yT(fc))  for  which  there  is  a  known  correspondence.  This  set  of  examples  is  the  only 
prior  information  that  we  use  in  the  algorithm.  We  assume  that  xa(i)  /  xa (2)  ^  ...  ^  xG(j~).  The  scheme 
given  in  Algorithm  3  computes  a  surjective  function  g  :  f2i  — ►  02  such  that  g(xa(ij)  =  yT( i),  •••,  g{xa(k))  = 

l/r(fc)  • 


Algorithm  3  Nonlinear  graph  alignment 
l:  Start  with  k  landmarks  (xa^),yT(1)),  {xa^yT[k)). 

2:  Compute  the  diffusion  embeddings  {aq, ...,  xn}  and  {yi, ....  yn>}  of  fi|  and  where,  for  each  set, 
the  time  parameter  was  chosen  so  that  k  —  1  eigenvectors  are  retained.  In  other  words,  xt  and  y:)  both 
live  in  Mfc_1. 

3:  Compute  the  affine  function  /  :  Mfc_1  — >  Mfc_1  that  satisfies  the  landmark  constraints: 

/ (^ct(I)  )  =  Vt(  1)  ?•••)/ (%cr(k) )  =  £/T(fc)  • 

4:  Define  the  correspondence  between  Cli  and  fi2  by 

g(xi)  =  argmin{||  f(xt)  -  j/||}, 

where  x*  e 


The  idea  behind  the  scheme  presented  is  to  embed  both  data  sets  into  the  (same)  diffusion  space,  and 
to  use  an  affine  alignment  function  /  in  the  diffusion  space.  We  assume  that  the  choice  of  the  kernels  for 
computing  the  embeddings  was  already  made  by  the  user,  and  that  they  were  selected  in  order  to  obtain 
meaningful  graphs  with  respect  to  the  application  that  the  user  has  in  mind.  The  number  of  eigenvectors 
used  for  the  embedding  is  directly  related  to  the  number  of  landmarks,  which  in  turns,  represents  the 
quantity  of  prior  information  for  aligning.  The  larger  the  number  of  known  constraints  on  the  alignment, 
the  larger  the  dimensionality  of  the  aligning  mapping.  This  is  consistent  with  the  fact  that  higher  order 
eigenvectors  capture  finer  structures.  These  observations  pave  the  way  for  a  general  sampling  theory  for 
data  sets.  Indeed,  the  landmarks  can  be  regarded  as  forming  a  subsampling  of  the  original  data  sets.  This 
subset  determines  the  largest  (or  Nyquist)  frequency  used  to  represent  the  original  set.  This  frequency  is 
measured  as  the  number  of  eigenvectors  employed. 

4 We  note  that  the  alignment  procedure  can  be  automated  for  low-dimensional  embeddings  (up  to  R 3)  by  utilizing  point  matching  schemes 
such  as  ICP  [28]  and  Geometrical  Hashing  [29]. 
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Note  also  that  the  affine  function  that  we  use  for  aligning  induces  a  nonlinear  mapping  defined  on 
lower  dimensional  embedding  of  the  sets,  and  is  even  more  nonlinear  in  the  original  space.  It  is  possible 
to  introduce  more  robustness  to  our  scheme  by  embedding  in  a  lower  dimension  than  the  number  of 
landmarks,  and  to  look  for  the  best  affine  function  that  aligns  the  landmarks,  where  “best”  is  measured 
in  a  least-square  sense. 


III.  Experimental  results 

A.  Application  to  lip-reading 

The  validity  of  our  approach  is  now  demonstrated  by  applying  it  to  lip-reading  and  sequence  alignment, 
which  are  typical  high-dimensional  data  analysis  problems.  From  the  statistical  learning  point  of  view,  this 
example  allows  us  to  apply  the  ideas  presented  in  the  previous  sections  to  three  fundamental  and  related 
problems  in  the  learning  of  high-dimensional  data  in  general,  and  visual  data  in  particular.  First,  we  apply 
the  diffusion  framework  to  perform  an  efficient  nonlinear  dimensionality  reduction.  Second,  we  extend  it 
to  derive  an  intensity  invariant  embedding,  essential  for  incorporating  several  data  sources.  Finally,  we 
deal  with  the  extension  of  a  given  embedding,  computed  on  a  given  data  set,  to  a  new  sample.  This  is 
the  essence  of  a  ‘learning’  schemes  that  associates  knowledge  obtained  on  a  training  set  to  a  new  set  of 
samples. 

Lip-reading  has  recently  gained  significant  attention  [30],  [31],  [32],  [33],  [34]  and  we  now  provide 
background  and  previous  results  in  that  field.  The  ultimate  goal  of  lip-reading  is  to  design  human-like 
man-machine  interfaces  allowing  automatic  comprehension  of  speech,  which  in  the  absence  of  sound  is 
denoted  as  lip-reading  and  the  synthesis  of  realistic  lip  movement.  The  design  of  such  a  system  involves 
three  main  challenges:  first,  the  feature  extraction,  which  aims  at  converting  the  images  of  the  lips  into  a 
useful  description,  must  be  achieved  with  minimal  preprocessing.  Then,  in  order  to  be  efficiently  processed, 
the  data  must  be  transformed  via  a  dimension  reduction  technique.  Last,  in  order  to  assimilate  new  data 
for  recognition,  one  must  be  able  to  perform  data  fusion. 

Previous  lip-reading  schemes  have  mainly  focused  on  the  first  two  points.  Concerning  the  feature 
extraction,  some  works  [30],  [34]  analyze  directly  the  intensity  values  of  the  input  images,  while  others 
[35],  [31]  start  by  detecting  curves  and  points  of  interest  around  the  mouth  whose  locations  are  then  used 
as  features.  The  combination  of  audio-visual  cues  was  used  in  [36]  where  the  visual  cues  are  the  extracted 
lip  contours  which  are  tracked  over  time.  We  note  that  combining  audio-visual  is  beyond  the  scope  of 
this  work  and  will  be  dealt  by  us  in  the  future.  Identifying,  tracking  and  segmenting  the  lips  is  a  difficult 
task  and  possible  solutions  include:  active  contours  [37],  probabilistic  models  [38]  and  the  combination  of 
multiple  visual  cues  (shape,  color  and  motion)  [39]  to  name  a  few.  In  practice,  one  strives  to  use  a  simple 
preprocessing  scheme  as  possible  and  in  our  scheme  we  employ  a  simple  stabilization  scheme  discussed 
below. 

Regarding  the  dimensionality  reduction,  several  schemes  have  been  used.  Preliminary  work  employed 
linear  algorithms  such  as  the  PC  A  and  SVD  subspace  projections  [35],  [34],  For  instance,  Li  et  al  [34] 
use  a  linear  PCA  scheme  similar  to  the  eigenfaces  approach  to  face  detection.  Recognition  is  performed 
by  correlating  an  input  sequence  with  the  eigenfeatures  obtained  from  PCA.  More  recent  schemes  [30] 
utilize  non-linear  approaches  such  as  the  MDS  [40],  Some  of  the  techniques  provide  a  general  embedding 
framework  for  lipreading  analysis  [30],  while  others  [34],  [31]  concentrate  on  a  particular  task  such  as 
phoneme  or  word  identification.  The  work  in  [41]  is  of  particular  interest,  since  it  is  one  of  the  first 
to  explicitly  formulate  the  lipreading  problem  as  a  “Manifold  Learning”  issue  and  tries  to  derive  the 
inherent  constraints  embedded  in  the  space  of  lip  configurations.  A  Hidden  Markov  Model  (HMM)  is 
used  to  model  a  small  number  of  words  (names  of  four  drinks)  which  define  the  Markov  states  and  the 
manifold.  The  HMM  is  then  used  to  recognize  the  drinks’  names  where  the  input  is  given  by  tracking 
the  outer  lips  contour  using  Active  Contours.  Utilizing  both  audio  and  visual  information  significantly 
decreased  the  error  rate,  especially  in  noisy  environments.  Kimmel  and  Aharon  [30]  applied  the  MDS 
scheme  to  visual  lips  representation,  analysis  and  synthesis.  A  set  of  lips  images  is  aligned  and  embedded 
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in  a  two  dimensional  domain  which  is  then  sampled  uniformly  in  the  embedding  domain  to  achieve 
uniform  density.  The  pronunciation  of  each  word  is  defined  as  a  path  over  the  embedding  domain  and 
used  for  visual  speech  recognition,  by  path  matching.  Lips  motion  synthesis  is  derived  by  computing 
the  geodesic  path  over  the  embedding  domain,  where  the  start  and  end  point  are  given  as  input.  Anchor 
points  in  the  low-dimensional  embedding  domain  were  then  used  to  match  the  lips  configurations  of  two 
different  speakers. 

Analysis  of  lip  data  constitutes  an  application  where  it  is  important  to  separate  the  set  of  nonlinear 
constraints  on  the  data  from  the  distribution  of  the  points.  As  an  illustration  of  the  Laplace-Beltrami 
normalization  as  well  as  the  out-of-sample  extension  scheme,  we  now  describe  an  elementary  experiment 
that  paves  the  way  to  building  automatic  lip-reading  machines,  and  more  generally,  machine  learning 
systems. 

We  first  recorded  a  movie  of  the  lips  of  a  subject  reading  a  text  in  English.  The  subject  was  then 
asked  to  repeat  each  digit  “zero”,  “one”,  ...  ,  “nine”  40  times.  A  minimal  preprocessing  was  applied  to 
the  recorded  sequence.  More  precisely,  it  was  first  converted  from  colors  to  gray  level  (values  between  0 
and  1).  Moreover,  using  a  marker  put  at  the  tip  of  the  nose  of  the  speaker  during  the  recording,  we  were 
able  to  automatically  crop  each  frame  into  a  rectangular  area  around  the  lips.  Each  of  these  new  frames 
was  then  regarded  as  a  point  in  E110xll°,  where  140  x  110  is  the  size  of  the  cropped  area. 

The  first  data  set,  consisting  of  approximately  5000  frames,  corresponds  to  the  speaker  reading  the 
text.  This  set  was  used  to  learn  the  structures  of  the  lip  motion.  More  precisely,  we  formed  a  graph 
with  Gaussian  weights  exp(—  \\xi  —  Xj\\2 /e)  on  the  edges  between  all  pairs  of  points,  where  the  distance 
\\xi~ Xj\\  was  merely  calculated  as  the  Euclidean  L 2  distance  between  frames  i  and  j.  The  scale  e  >  0  was 
chosen  by  looking  at  the  distribution  of  the  distances  from  each  point  to  the  other  points.  We  selected  \/e 
such  that  each  data  point  would  be  numerically  connected  with  at  least  one  other  point  in  the  graph.  This 
value,  which  was  found  to  be  equal  to  1000,  turned  out  to  make  the  graph  of  the  data  totally  connected. 
The  choice  of  this  number  was  also  coherent  with  the  shape  of  the  distribution  of  the  distances  (see  Figure 
2)  in  that,  on  average,  each  point  is  connected  to  a  small  fraction  of  the  other  points. 
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Fig.  2.  The  distribution  of  distances  between  all  pairs  of  data  points.  The  choice  of  the  scale  y/e  —  1000  corresponds  to  having  each  data 
point  connected  to  at  least  one  other  data  point.  The  resulting  graph  happened  to  be  totally  connected.  This  histogram  shows  that  the  choice 
of  this  scale  parameter  leads  to  a  sparse  graph:  each  node  is  connected,  on  average,  to  a  small  number  of  other  nodes. 


We  then  renormalized  the  Gaussian  weights  using  the  Laplace-Beltrami  normalization  described  in 
Section  II-B.  By  doing  so,  our  analysis  focused  on  viewing  the  mouth  as  a  constrained  mechanical 
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system.  In  order  to  obtain  a  low-dimensional  parametrization  of  these  nonlinear  constraints,  we  computed 
the  diffusion  coordinates  on  this  new  graph.  The  spectrum  of  the  diffusion  matrix  is  plotted  on  Figure  3 
and  the  embedding  in  the  first  3  eigenfunctions  is  shown  on  Figure  4. 


Fig.  3.  The  top  100  eigenvalues  of  the  diffusion  matrix  for  the  lips  data.  The  spectrum  decays  rapidly. 


Fig.  4.  The  embedding  of  the  lip  data  into  the  top  3  diffusion  coordinates.  These  coordinates  essentially  capture  two  parameters:  one 
controlling  the  opening  of  the  mouth  and  the  other  measuring  the  portion  of  teeth  that  are  visible. 


The  task  we  wanted  to  perform  was  isolated-word  recognition  on  a  small  vocabulary.  The  example 
that  we  considered  was  that  of  identification  of  digits.  Each  word  “zero”,  “one”,...,  “nine”  is  typically 
a  sequence  25  to  40  frames  that  we  need  to  project  in  the  diffusion  space5.  In  order  to  do  so,  we  used 
the  geometric  harmonic  extension  scheme  presented  in  Section  II-C  to  extend  each  diffusion  coordinate 
to  the  frames  corresponding  to  the  subject  pronouncing  the  different  digits.  After  this  projection,  each 
word  can  be  viewed  as  a  trajectory  in  the  diffusion  space.  The  word  recognition  problem  now  amounts 
to  identifying  trajectories  in  the  diffusion  space. 

5Note  that  this  second  data  set  was  not  used  to  compute  the  diffusion  maps. 
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We  can  now  build  a  classifier  based  on  comparing  a  new  trajectory  to  a  collection  of  labeled  trajectories 
in  a  training  set.  We  randomly  selected  20  instances  of  each  digit  to  form  a  training  set,  the  remaining 
20  being  used  as  a  testing  set.  In  order  to  compare  trajectories  in  the  diffusion  space,  a  metric  is  needed, 
and  we  chose  to  use  the  Hausdorff  distance  between  two  sets  Ti  and  r2,  defined  as 

dw-(Ti,  r2)  =  max  <  max  min  \\\xi  —  x2||),  max  min  {\\xi  —  rr2||) 

^2er2xieri  xieiTx2er2  1 

Although  this  distance  does  not  use  the  temporal  information,  it  has  the  advantage  of  not  being  sensitive  to 
the  choice  of  a  parametrization  or  to  the  sampling  density  for  either  set  and  r2.  For  a  given  trajectory 
T  from  the  testing  set,  our  classifier  is  a  nearest-neighbor  classifier  for  this  metric,  /.<?.,  the  class  of  F  is 
decided  to  be  that  of  the  nearest  trajectory  (for  dH )  in  the  training  set.  The  performance  of  this  classifier 
averaged  over  100  random  trials  is  shown  in  Table  I.  In  this  case,  the  data  set  was  embedded  in  15 
dimensions. 
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TABLE  I 

Classifier  performance  over  100  random  trials.  Each  row  corresponds  the  classification  distribution  of  a  given 

DIGIT  OVER  THEN  10  CLASSES.  THE  DATA  SET  WAS  EMBEDDED  IN  15  DIMENSIONS. 


The  classification  error  ranges  from  0%  to  31%  with  an  average  of  12.2%.  The  best  classification  rate  is 
achieved  for  the  word  “one”  which,  in  terms  of  visual  information,  stands  far  away  from  the  other  digits. 
In  particular,  typical  sequences  of  “one”  involve  frames  with  a  round  open  mouth,  with  no  teeth  visible 
(see  first  row  of  Figure  5).  These  frames  essentially  never  appear  for  other  digits.  The  worst  classification 
job  is  for  the  word  “seven”  which  seems  to  be  highly  confused  with  the  words  “five”  and  “six”.  As  shown 
on  Figure  5,  typical  instances  of  these  words  appear  to  be  similar  in  that  the  central  frames  involve  an 
open  mouth  with  visible  teeth.  In  the  case  of  the  “six”  and  “seven”,  teeth  from  the  lower  jaws  are  visible 
because  of  the  “s”  sound.  Regarding  the  similarity  between  “five”  and  “seven”,  the  ”f  ’  and  ”v”  sounds 
translate  into  the  lower  lip  touching  the  teeth  of  the  upper  jaw. 

The  accuracy  that  we  obtain  is  comparable  to  former  schemes  [30],  [41],  while  using  significantly 
less  preprocessing.  For  instance,  in  [30],  the  lips  images  are  hand  picked  and  stabilized  using  an  affine 
motion  model,  while  in  [41]  the  contours  of  the  lips  are  tracked  by  Active  Contours.  Our  lips  images  are 
acquired  by  taping  a  continuous  5  minutes  sequence  and  a  simple  cropping  is  performed  to  compensate 
for  translations.  We  note  that  the  above  comparison  is  qualitative  rather  than  quantitative,  as  the  different 
schemes  were  applied  to  different  datasets  that  are  not  publicly  available. 


B.  Synchronization  of  head  movement  data 

We  now  illustrate  the  concept  of  graph  alignment  as  well  as  the  algorithm  presented  in  Section  II-D. 
We  recorded  3  movies  of  subjects  wearing  successively  a  yellow,  red  and  black  mask.  Each  subject  was 
asked  to  move  their  head  in  front  of  the  camcorder.  We  then  considered  the  three  sets  consisting  of  all 
frames  of  each  movie.  Let  YELLOW,  RED  and  BLACK  denote  these  sets.  Our  goal  was  to  synchronize 
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Fig.  5.  Typical  frames  for  the  words  “one”,  “five”,  “six”,  “seven”. 


the  movements  of  the  different  masks  by  aligning  the  3  diffusion  embeddings.  The  objective  of  this 
experiment  was  twofold 

•  We  first  wanted  to  illustrate  the  importance  of  having  a  coordinate  system  capturing  the  intrinsic 
geometry  of  data  sets.  The  intrinsic  geometry  is  the  basis  of  our  alignment  scheme:  the  key  point 
is  that,  as  we  will  show,  all  three  sets  exhibit  approximately  the  same  intrinsic  geometry,  and  that 
the  diffusion  coordinates  parameterize  this  geometry.  It  is  to  be  noted  that  working  directly  in  image 
space  would  be  highly  inefficient  since  any  picture  of  the  red  or  black  mask  is  at  a  large  distance  from 
the  set  of  pictures  of  the  yellow  mask  (this  is  a  straight  consequence  of  the  high  dimensionality  of  the 
data).  On  the  contrary,  the  diffusion  coordinates  will  capture  the  intrinsic  organization  of  each  data 
sets,  and  therefore  will  provide  a  canonical  representation  of  the  sets  that  can  be  used  for  matching 
the  data.  Note  also  that  our  approach  does  not  require  any  prior  information  on  the  type  of  data  we 
are  dealing  with. 

•  The  other  point  that  we  wished  to  illustrate  is  the  importance  of  using  the  density-invariant  diffusion 
maps.  As  we  will  show,  although  the  three  sets  have  approximately  the  same  intrinsic  geometry  (the 
data  points  lie  on  the  same  2D  submanifold),  the  distribution  of  the  points  on  this  manifold  are  quite 
different.  Therefore,  it  is  necessary  to  employ  the  density  re-normalization  technique  described  in 
Section  II-B. 

These  two  points  constitute  the  main  ingredients  for  a  successful  alignment  of  the  sets. 

We  now  describe  the  experiment  in  more  details.  Each  set  of  frames  was  regarded  as  a  collection  of 
points  in  M10000,  where  the  dimensionality  coincides  with  the  number  of  pixels  per  image.  Following  the 
lines  of  our  algorithm,  we  formed  a  graph  from  each  set  with  Gaussian  weights  exp(— ||x;  —  Xj\\2/e). 
The  quantity  \\xt  —  x3  \ |  represents  the  L2  norm  between  images  i  and  j,  and  here  again,  the  scale  was 
chosen  so  that  each  data  point  would  be  numerically  connected  to  at  least  one  other  data  point.  We  expect 
each  set  to  lie  approximately  on  a  manifold  of  dimension  2,  as  each  subject  essentially  moved  their  head 
along  two  angles  a  and  (3  shown  on  Figure  6  and  as  the  light  conditions  were  kept  the  same  during  the 
recording.  Therefore,  each  data  sets  is  the  expression  of  a  highly  constrained  mechanical  system,  namely 
the  articulation  between  the  neck  and  the  head. 

It  is  clear  that  the  density  of  points  on  this  manifold  is  essentially  arbitrary  and  varies  with  each  subject 
and  recording.  Indeed,  the  density  is  essentially  a  function  of  the  type  of  movement  of  each  subject,  their 
speed  of  execution,  and  also  the  type  of  mask  that  they  were  wearing.  Since  we  were  only  interested 
in  the  space  of  constraints,  that  is  the  geometry  of  the  manifold,  we  renormalized  the  Gaussian  weights 
according  to  the  algorithm  described  in  Section  II-B,  and  constructed  a  Markov  chain  that  approximates 
the  Laplace-Beltrami  diffusion.  Figure  7  shows  the  embedding  in  the  first  three  eigenfunctions  for  each 
data  set.  They  are  extremely  similar.  We  then  defined  8  matching  triplets  of  landmarks  in  each  set.  The 
landmarks  were  chosen  to  correspond  to  the  main  head  positions.  We  computed  the  diffusion  embedding 
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Fig.  6.  Each  subject  essentially  moved  their  head  along  the  two  angles  a  and  (3.  There  was  almost  no  tilting  of  the  head.  Hence,  the  data 
points  approximately  lie  on  a  submanifold  of  dimension  2. 


Fig.  7.  The  embedding  of  each  set  in  the  first  3  diffusion  coordinates.  The  color  encodes  the  density  of  points.  All  three  sets  share  this 
butterfly- shaped  embedding 


in  7  dimensions  and  we  then  calculated  two  affine  functions  gYR  :  M7  — »•  M7  and  gYB  :  M7  — >  M7  that 
match  the  landmarks  from  YELLOW  to  BLACK,  and  from  YELLOW  to  RED. 

Two  conclusions  can  be  drawn  from  this  experiment.  First,  the  diffusion  embedding  revealed  that  the 
data  sets  were  approximately  2-dimensional,  as  expected  (see  Figure  7  for  the  embeddings  in  the  first  3 
diffusion  coordinates).  The  diffusion  coordinates  captured  the  main  parameters  of  variability,  namely  the 
angles  a  and  f3.  From  the  embedding  plots,  it  can  be  seen  that  all  three  embedded  sets  have  strikingly 
similar  shapes.  This  supports  our  intuition  that  all  sets  should  have  similar  intrinsic  geometries.  From 
this  observation,  we  were  able  to  successfully  compute  two  aligning  functions  gYB  and  gYR,  and  we  used 
them  to  drive  the  movements  of  the  black  and  red  masks  from  those  of  the  yellow  mask.  The  result  of 
the  matching  of  the  three  data  sets  is  shown  on  Figure  8.  A  live  demo  of  this  experiment  can  be  found 
at  [42], 

The  other  conclusion  concerns  the  importance  of  having  used  the  density  normalized  diffusion  coor¬ 
dinates.  A  key  point  in  our  analysis  is  that  to  compare  the  intrinsic  geometries  of  each  set,  we  need 
to  be  able  to  get  rid  of  the  influence  of  the  points  on  the  2D  submanifold.  In  order  to  underline  the 
importance  of  this  idea,  we  also  computed  the  embedding  of  the  three  Yellow  and  BLACK  without  this 
renormalization.  According  to  the  discussion  of  Section  II-B,  the  embedded  sets  should  now  reflect  both 
the  constraints  (the  intrinsic  geometry)  and  the  distribution  of  the  points  (the  density  on  the  submanifold). 
The  result  is  shown  on  Figure  9,  and  although  the  embedding  of  the  BLACK  set  still  retain  this  butterfly 
shape  that  we  previously  obtained  when  renormalizing,  the  YELLOW  set  is  now  embedded  as  some 
portion  of  an  ovoid.  Although  this  statement  can  seem  very  qualitative,  it  is  now  clear  that  the  alignment 
of  these  sets  should  fail.  This  experiment  therefore  underlines  the  importance  of  being  able  to  compute 
density-invariant  embeddings  of  the  data. 
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Fig.  8.  The  embedding  of  the  YELLOW  set  in  three  diffusion  coordinates  and  the  various  corresponding  images  after  alignment  of  the 
RED  and  BLACK  graphs  to  YELLOW. 


Fig.  9.  The  embeddings  of  the  YELLOW  (a)  and  BLACK  (b)  sets  in  three  diffusion  coordinates  without  the  density  renormalization.  These 
embedded  sets  now  have  very  different  shapes,  and  their  alignment  is  impossible. 


IV.  Conclusion  and  future  work 

In  this  work  we  introduced  diffusion  techniques  as  a  framework  for  data  fusion  and  multi-cue  data 
matching  by  addressing  several  key  issues.  First,  we  underlined  the  importance  of  the  Laplace-Beltrami 
normalization  for  data  fusion  by  showing  that  it  allows  to  merge  data  sets  produced  by  the  same  source 
but  with  different  densities.  In  particular,  the  Laplace-Beltrami  embedding  provides  a  canonical,  density- 
invariant  embedding  which  is  essential  for  data  matching.  Second,  we  suggested  a  new  data  fusion  scheme, 
by  extending  spectral  embeddings  using  the  geometric  harmonics  framework.  Finally,  we  presented  a  novel 
spectral  graph  alignment  approach  to  data  fusion. 

Our  scheme  was  successfully  applied  to  lip-reading  where  we  achieved  high  accuracy  with  minimal 
preprocessing.  We  also  demonstrated  the  alignment  of  high-dimensional  visual  data  (“rotating  heads” 
sequence). 

In  the  work  presented,  we  have  focused  on  the  situation  when  all  sources  are  highly  correlated.  In 
the  future  we  plan  on  extending  our  approach  to  multi-cue  data  analysis  by  integrating  different  signals 
from  weakly  correlated  sources  into  a  unified  representation.  This  should  open  the  door  to  applications 
related  to  multi-sensor  integration.  Finally,  we  also  are  studying  a  spectral  based  approach  to  the  analysis 
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of  signals  as  dynamical  random  processes.  Our  current  work  did  not  utilize  the  temporal  information  of 
the  video  sequences.  By  constructing  a  dynamical  Markov  process  model,  we  intend  to  improve  the  lips 
reading  accuracy. 
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Appendix  I 

Existence  and  uniqueness  of  the  stationary  distribution 

The  goal  of  this  section  is  to  show  that  if  the  graph  is  connected,  then  the  stationary  distribution  <£>0  is 
guaranteed  to  exist.  The  first  step  is  to  notice  that  the  data  set  is  finite,  and  therefore  so  is  the  state  space 
of  our  Markov  chain.  Thus  by  a  classical  version  of  the  Perron-Frobenius  theorem,  it  suffices  to  prove 
that  the  chain  is  irreducible  and  aperiodic. 

•  The  irreducibility  is  a  mere  consequence  of  the  fact  that  the  graph  is  connected.  Indeed,  let  xt  and 
xj  be  two  data  points,  and  let  r  be  the  length  of  a  path  connecting  Xi  and  xr  Since  the  graph  is 
connected,  we  know  that  r  <  +oo.  We  conclude  that  pT(xi,Xj )  >  0,  which  implies  that  the  chain 
irreducible. 

•  Concerning  the  aperiodicity,  remember  that  w(-,  •)  represent  the  similarity  between  data  points,  so 
we  can  assume  that  for  all  data  point  we  have  ■w(xl,xl)  >  0.  Consequently,  pi(xlJ  xt)  >  0,  which 
implies  that  the  chain  is  aperiodic. 

Finally,  we  can  conclude  that  our  Markov  chain  has  a  unique  stationary  distribution  0O. 


Appendix  II 

Diffusion  distance  and  eigenfunctions 

The  random  walk  constructed  from  a  graph  via  the  normalized  graph  Laplacian  procedure  yields  a 
Markov  matrix  P  with  entries  pi(x,y).  As  it  is  well  known  [15],  this  matrix  is  in  fact  conjugate  to  a 
symmetric  matrix  A  with  entries  a(x,y),  given  by 


a(x,y) 


pi(x,y) 


w(x,y) 

y/d(x)d(y) 


Therefore  A  has  n  eigenvalues  A0, ...,  An_i  and  orthonormal  eigenvectors  v0, ...,  vn-\.  In  particular, 


n—  1 

a(x,y)  =  J^A ivi(x)vi(y) . 
1=0 


(9) 


This  implies  that  P  has  the  same  n  eigenvalues.  In  addition,  it  has  n  left  eigenvectors  0 0, ...,  <f>n- i  and  n 
right  eigenvectors  0O, ...,  ipn-i.  Also,  it  can  be  checked  that 


Mv)  =  Mv)vo(y )  and  ipi(x)  =  Vi(x)/v o(x) .  (10) 

Furthermore,  it  can  be  verified  that  v0(x)  =  ^Jd(x)/ y/Y2z  d(z),  and  therefore  (ftoiy)  =  d(y') /  d(z)  and 
0o (x)  =  1.  In  addition, 

<po(x)ipi(x)  =  <j>i(x) .  (11) 

It  results  from  Equations  9  and  10  that  P*  admits  the  following  spectral  decomposition: 

n—  1 

Pt(x,  y)  =  A iMx)My)  >  (12) 

1=0 
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together  with  the  biorthogonality  relation 


=  fin ,  (13) 

yef2 


where  5ij  is  Kronecker  symbol.  Combining  this  last  identity  with  Equation  1 1 ,  one  obtains 


E 


MMMM 

My) 


This  means  that  the  system  {</>/}  is  orthonormal  in  L2(0,  1/M-  Therefore,  if  one  fixes  x,  Equation  12 
can  interpreted  as  the  decomposition  of  the  function  over  this  system,  where  the  coefficients  of 

decomposition  are  {A;E;(x)}. 

Now  by  definition, 


Dt(x,z )2  = 

yEfl 


(pt(x,y)  zjMMf 
Mv) 


\\pt(x,-)  -Pt(z,’)\\h{a,l/^)- 


Therefore, 

Dt(x,  y )2  =  Y  M(MX)  ~  Mz))2  ■ 

1=0 
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Abstract 

We  introduce  Greedy  Basis  Pursuit  (GBP),  a  new  algorithm  for  computing  sparse  signal 
representations  using  overcomplete  dictionaries.  GBP  is  rooted  in  computational  geometry  and 
exploits  an  equivalence  between  minimizing  the  G-norm  of  the  representation  coefficients  and 
determining  the  intersection  of  the  signal  with  the  convex  hull  of  the  dictionary.  GBP  unifies 
the  different  advantages  of  previous  algorithms:  like  standard  approaches  to  Basis  Pursuit, 
GBP  computes  representations  that  have  minimum  G-norm;  like  greedy  algorithms  such  as 
Matching  Pursuit,  GBP  builds  up  representations,  sequentially  selecting  atoms.  We  describe 
the  algorithm,  demonstrate  its  performance,  and  provide  code.  Experiments  show  that  GBP 
can  provide  a  fast  alternative  to  standard  linear  programming  approaches  to  Basis  Pursuit. 


1  Introduction 

The  problem  of  computing  sparse  signal  representations  using  an  overcomplete  dictionary  arises  in 
a  wide  range  of  signal  processing  applications  [87,  34,  55],  including  image  [10,  105],  audio  [68,  43], 
and  video  [6]  compression  and  source  localization  [71].  The  goal  is  to  represent  a  given  signal  as 
a  linear  superposition  of  a  small  number  of  stored  signals,  called  atoms ,  drawn  from  a  larger  set, 
called  the  dictionary .  In  traditional  signal  representation  methods,  such  as  the  DCT  or  various 
wavelet  transforms,  the  dictionary  is  simply  a  basis:  the  number  of  atoms  in  the  dictionary  is  equal 
to  the  dimensionality  of  the  signal  space  and  representation  is  unique.  By  contrast,  in  an  overcom¬ 
plete  dictionary  the  number  of  atoms  is  greater  than  the  dimensionality  of  the  signal  space  and 
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representation  is  no  longer  unique;  this  enables  flexibility  in  representation  [72],  ‘shiftability’  [93], 
and  the  use  of  multiple  bases  [62,  97],  but  it  requires  a  criterion  to  select  from  among  the  (many) 
possibile  representations.  A  natural  one  is  sparsity,  by  which  the  representation  selected  is  the  one 
that  uses  as  few  atoms  as  possible. 

Computing  sparse  representations  is  NP-hard  [78,  31],  and  so  several  (heuristic)  methods  have 
been  developed  [72,  83,  19,  56].  These  methods  optimize  various  measures  of  sparsity,  typically 
functions  of  the  representation  coefficients  [66,  65],  using,  for  example,  greedy  algorithms  [72], 
gradient  descent  [69],  linear  programming  [21],  and  global  optimization  [86].  Currently,  the  two 
most  popular  approaches  are  Matching  Pursuit  [72]  and  Basis  Pursuit  [20,  21]. 

Matching  Pursuit  (MP)  is  a  greedy  algorithm:  a  signal  representation  is  iteratively  built  up 
by  selecting  the  atom  that  maximally  improves  the  representation  at  each  iteration.  While  there 
is  no  guarantee  that  MP  computes  sparse  representations,  MP  is  easily  implemented,  converges 
quickly,  and  has  good  approximation  properties  [72,  100,  58].  Moreover,  MP  and  one  of  its  variants, 
Orthogonal  Matching  Pursuit  (OMP)  [83],  can  be  shown  to  compute  sparse  (or  nearly  sparse) 
representations  under  some  conditions  [102,  58]. 

Basis  Pursuit  (BP),  instead  of  seeking  sparse  representations  directly,  seeks  representations 
that  minimize  the  ^-norm  of  the  coefficients.  By  equating  signal  representation  with  C-norm 
minimization,  BP  reduces  signal  representation  to  linear  programming  [20,  21],  which  can  be  solved 
by  standard  methods  [104].  Furthermore,  BP  methods  can  compute  sparse  solutions  in  situations 
where  greedy  algorithms  fail  [21].  Recent  theoretical  work  shows  that  representations  computed  by 
BP  are  guaranteed  to  be  sparse  under  certain  conditions  [37,  36,  51,  103]. 

While  applying  standard  linear  programming  methods  to  compute  minimum  t^-norm  signal 
representations  is  natural,  such  methods  were  developed  with  very  different  problems  in  mind  and 
may  not  be  ideally  suited  to  the  representation  problem.  For  example,  if  the  matrix  corresponding 
to  the  dictionary  is  not  sparse  then  the  (normally  fast)  interior  point  methods  advocated  for  BP  [21] 
can  be  slow.  Furthermore,  the  design  required  to  produce  examples  on  which  greedy  algorithms  fail 
yet  BP  succeeds  suggests  that  a  greedy  strategy  could  be  successfully  applied  to  minimum  ^1-norm 
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representation. 

In  this  article  we  develop  a  new  algorithm  for  computing  sparse  signal  representations,  which 
we  call  Greedy  Basis  Pursuit  (GBP).  GBP  is  an  algorithm  for  BP:  it  minimizes  the  ^1-norm  of  the 
representation  coefficients.  However,  unlike  standard  linear  programming  methods  for  BP,  GBP 
proceeds  much  like  MP,  building  up  the  representation  by  iteratively  selecting  atoms. 

While  algorithmically  similar  to  MP,  GBP  differs  from  MP  in  two  key  ways:  (1)  GBP  uses 
a  novel  criterion  for  selecting  the  next  atom  in  the  representation.  The  criterion  is  based  on 
computational  geometry,  and  effects  a  search  for  the  intersection  between  the  signal  vector  and 
the  convex  hull  of  the  dictionary.  (2)  GBP  may  discard  atoms  that  it  has  already  selected;  this 
is  crucial,  as  it  allows  GBP  to  overcome  the  ‘mistakes’  that  MP  can  make  in  atom  selection  when 
compared  to  BP  [21]. 

While  GBP  returns  the  signal  representation  with  the  minimum  ^1-norm,  and  thus  GBP  enjoys 
the  theoretical  benefits  of  BP,  the  greedy  strategy  of  GBP  leads  to  computational  gains  when 
compared  to  standard  linear  programming  methods.  Experiments  show  our  implementation  of 
GBP  to  be  faster  than  off-the-shelf  linear  programming  packages  on  some  signal  representation 
problems,  particularly  high-dimensional  problems  with  very  overcomplete  dictionaries. 

The  remainder  of  this  paper  is  organized  as  follows.  In  Section  1.1  we  formally  state  the  sparse 
signal  representation  problem.  In  Section  2  we  review  current  approaches  to  the  problem.  Section  3 
provides  the  geometric  interpretation  of  Basis  Pursuit  that  underlies  GBP.  In  Section  4  we  describe 
the  Greedy  Basis  Pursuit  algorithm.  Section  5  present  the  results  of  experiments  with  GBP.  We 
discuss  GBP  in  Section  6  and  conclude  in  Section  7. 

1.1  Problem  Statement 

Given  a  signal  x  and  a  dictionary  V  we  seek  a  sparse  representation  of  x.  We  assume  that  x  consists 
of  d  real  valued  measurements,  that  is,  x  E  Rd,  for  example,  a  sound  wave  sampled  at  d  points.  We 
assume  that  V  consists  of  n  atoms  and  is  overcomplete,  that  is,  V  =  and  n  >  d,  and  that 

the  atoms  are  also  d-dimensional  and  have  unit  norm,  that  is,  G  and  H^lb  =  1.  A 
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representation  of  x  is  a  set  of  indices  1  into  £>,  where  1  C  {1, . . .  ,  n},  and  a  corresponding  set  of 
coefficients  A  —  {cqjiex  such  that 

X  =  £  ailpi  (1) 

A  representation  is  sparse  if  the  number  of  atoms  used,  \I\  (here  |  •  |  denotes  cardinality),  is 
minimized  over  all  possible  representations. 

Equivalently,  in  matrix  notation,  given  a  (column)  vector  xGKd  corresponding  to  the  signal, 
and  a  d  x  n  matrix  D  corresponding  to  the  dictionary,  where  the  ith  column  of  D  is  the  atom 
the  sparse  signal  representation  problem  is  then  to  compute  a  (column)  vector  a  E  solving 

Minimize  ||<a||o  subject  to  D a  —  x  (2) 

where  \\a\\o  is  the  t^-norm  of  <a,  defined  to  be  the  number  of  nonzero  entries  of  a.  In  general,  the 
equality  constraint  can  be  relaxed  to  give  a  corresponding  approximation  problem;  see  [78,  100, 
102,  103]. 

BP  replaces  the  ^°-norm  with  the  t^-norm,  seeking  representations  that  minimize  Y2iei  |cq|.  In 
matrix  form  this  corresponds  to 

Minimize  \\a\\i  subject  to  D<a  =  x  (3) 

GBP  solves  3.  The  approximation  problem  corresponding  to  BP  is  called  Basis  Pursuit  Denoising; 
see  [21,  69,  42]. 

2  Related  work 

The  design  of  GBP  draws  on  previous  work  in  sparse  signal  representation,  particularly  the  contrast 
between  MP  and  BP,  and  on  ideas  from  subset  selection,  which  we  summarize  here.  We  also  high¬ 
light  some  unexplored  connections  between  sparse  signal  representation  and  linear  programming. 

2.1  Matching  Pursuit 

Matching  Pursuit  (MP)  [72]  is  the  prototypical  greedy  algorithm  [23]  applied  to  sparse  signal  repre¬ 
sentation.  MP  is  currently  the  most  popular  algorithm  for  computing  sparse  signal  representations 
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using  an  overcomplete  dictionary,  and  is  used  in  a  variety  of  applications  [10,  84,  6].  MP  has 
also  spawned  several  variants  [46,  63,  47],  including  Orthogonal  Matching  Pursuit  (OMP)  [83,  32], 
which  itself  has  several  variants  [54,  24,  88]. 

MP  computes  a  signal  representation  by  greedily  constructing  a  sequence  of  approximations 
to  the  signal,  x®,  x^1),  x^2), . . .,  where  each  consecutive  approximation  is  closer  to  the  signal. 
MP  begins  with  an  ‘empty’  representation,  x(°)  =  0,  and  at  each  iteration  augments  the  cur¬ 
rent  representation  by  selecting  the  atom  from  the  dictionary  which  is  closest  to  the  residual, 
x(t+1)  =  q/(0^(05  where  =  argma x^G^(V;?x  —  x). 

MP  is  easy  to  implement,  has  a  guaranteed  exponential  rate  of  convergence  [72,  100,  58],  and 
recovers  relatively  sparse  solutions  [102],  particularly  compared  to  earlier  approaches  such  as  the 
Method-of- Frames  [29,  21]. 

A  drawback  of  MP  applied  to  sparse  representation  is  its  greediness.  It  is  possible  to  construct 
signal  representation  problems  where,  because  of  its  greediness,  MP  (or  OMP)  intially  selects  an 
atom  that  is  not  part  of  the  optimal  sparse  representation;  as  a  result,  many  of  the  subsequent 
atoms  selected  by  MP  simply  compensate  for  the  poor  initial  selection  [33,  21].  This  shortcoming 
motivated  the  development  of  BP,  which  succeeds  on  these  problems[21];  recent  theoretical  work 
explains  this  phenomenon  [37,  36,  51]. 

These  problems  are  also  motivation  for  the  development  of  GBP.  Here  MP  fails  because  of  its 
poor  intial  selection  of  atoms;  however,  the  atoms  intially  selected  by  MP  are  not  necessarily  bad 
in  general,  after  all,  these  problems  are  specially  designed  for  MP  to  fail  on.  For  MP  to  succeed  on 
these  problems,  it  would  need  to  either  make  ‘better’  atom  selections  or  be  able  to  discard  ‘bad’ 
atoms  to  recover  from  poor  selections  (or  both).  GBP  adapts  the  greedy  strategy  to  incorporate 
both  of  these  ideas  and  compute  the  same  representations  as  BP. 

2.2  Basis  Pursuit 

Basis  Pursuit  (BP)  [19,  20,  21]  approaches  sparse  signal  representation  by  changing  the  problem  to 
one  of  minimizing  the  ^-norm  of  the  representation  coefficients.  This  can  be  interpreted  as  assuming 
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a  ‘sparse  prior’  on  the  representation  coefficients  [80].  The  t^-norm  in  particular  implies  that  the 
resulting  representations  are  sparse  in  the  t^-norm  sense  under  certain  conditions  [37,  36,  51]  and 
algorithmically  equates  sparse  signal  representation  with  linear  programming. 

A  linear  program  is  defined  as  follows:  Given  a  matrix  A  G  RmXn,  a  (column)  vector  b  G  Mm, 
and  a  (column)  vector  c  G  Rn,  compute  a  (column)  vector  x  G  satisfying 

Minimize  cTx  subject  to  Ax  =  b,  X{>  0  (4) 

The  signal  representation  problem  is  posed  in  BP  as  a  linear  program  with  the  following  as¬ 
signments  (the  variables  on  the  right  hand  side  are  as  defined  in  Section  1.1  and  the  variables  on 
the  left  hand  side  plug  into  the  linear  program  above): 

[^1^2  i>n  “Vh  -^2  •••  -VVi] 

X 

[11 ...  1] 

[  OL\  OL2  •  •  •  otn] 

Minimizing  cTx  is  equivalent  to  minimizing  the  t^-norm  of  the  coefficients.  Note  that  A,  corre¬ 
sponding  to  the  dictionary,  is  doubled  to  include  the  negative  of  each  atom;  this  is  due  to  the  linear 
programming  constraint  that  the  coefficients  be  nonnegative  4. 

Chen  et  al.  [20,  21]  describe  two  algorithms  for  BP,  BP-Simplex  and  BP-Interior,  which  are 
the  well-known  simplex  and  interior  point  methods  of  linear  programming  [104]  applied  to  signal 
representation.  The  choice  of  which  BP  algorithm  to  use  depends  on  the  structure  of  the  dictionary: 
for  dictionaries  that  have  fast  transforms,  BP-Interior  exploits  these  transforms  in  the  solution  of  the 
corresponding  linear  program.  However,  the  running  time  of  linear  programming  is  still  typically 
an  order  of  magnitude  slower  than  that  of  MP  on  realistic  problems  [21]. 

While  standard  linear  programming  methods  have  been  highly  tuned  over  time,  they  are  not 
necessarily  ideally  suited  to  the  specific  problem  of  computing  signal  representations.  For  example, 
many  linear  programming  methods  assume  that  the  matrix  A  is  sparse,  as  is  the  case  for  con¬ 
straints  that  arise  in  typical  operations  research  problems,  while  this  may  not  be  the  case  in  signal 
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representation  problems.  This  raises  the  possibility  that  alternative  approaches  could  prove  more 
efficient  for  the  particular  problem  of  signal  representation.  Some  inspiration  for  an  alternative 
approach  is  provided  by  Chen  et  al  [21],  who  contrast  MP  and  BP-Simplex,  characterizing  MP 
as  a  ‘build-up’  approach  and  BP-Simplex  as  a  ‘swap-down’  approach.  If  A  is  not  sparse,  then 
the  swaps  (or  pivots)  executed  by  BP-Simplex  can  be  costly,  in  the  computation  of  an  individual 
swap,  in  the  number  of  swaps,  and  in  the  computation  of  an  initial  basis.  GBP  instead  takes  the 
‘build-up’  approach  to  solving  linear  programming. 

2.3  Subset  selection 

Sparse  signal  representation  is  closely  related  to  the  problem  of  subset  selection  for  regression, 
i.e.,  determining  the  optimal  subset  of  variables  on  which  to  regress  a  data  set  [74].  In  sparse 
signal  representation,  the  signal  corresponds  to  the  data  set,  while  the  atoms  correspond  to  the 
variables.  In  fact,  MP  was  inspired  by  Projection  Pursuit  [50,  61],  in  particular  its  use  as  a  regression 
algorithm  [49].  Given  this  connection,  it  should  not  be  surprising  that  some  algorithmic  ideas  in 
sparse  signal  representation  correspond  to  earlier  work  in  regression.  For  example,  in  Forward 
Selection  the  optimal  subset  is  constructed  by  starting  with  the  empty  subset  and  iteratively  adding 
variables  to  it,  selecting  at  each  iteration  the  variable  that  accounts  for  most  of  the  residual  variance; 
this  is  essentially  what  OMP  does.  Backward  Elimination,  which  starts  with  the  full  set  of  variables 
and  iteratively  pares  it  down,  has  similarly  been  adapted  for  signal  representation  [59,  25]. 

One  standard  algorithm  for  subset  selection  in  regression  which  appears  to  have  no  analogue  in 
sparse  signal  representation  is  Efroymson’s  algorithm  [41],  also  called  step-wise  regression,  proceeds 
like  Forward  Selection,  but,  like  Backward  Elimination,  drops  variables  from  the  subset  as  they 
become  irrelevant.  GBP  follows  a  similar  strategy,  iteratively  selecting  atoms  and  occasionally 
discarding  them. 

2.4  Linear  programming 

While  Basis  Pursuit  represents  the  first  formal  casting  of  signal  representation  as  linear  program¬ 
ming,  linear  programming  has  long  been  used  in  sparse  signal  representation,  particularly  for  de- 
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convolution  in  various  applications  [40,  8,  79].  It  is  therefore  not  surprising  that  developments  in 
sparse  signal  representation  closely  parallel  earlier  developments  in  linear  programming. 

Examining  the  literature  in  linear  programming  reveals  that  MP  and  OMP  have  linear  pro¬ 
gramming  analogues:  MP  is  technically  equivalent  to  one  of  the  earliest  (1948)  methods  developed 
for  linear  programming,  called  von  Neumann’s  algorithm  [26].  Similarly,  OMP  is  equivalent  to  a 
phase  I  algorithm  [67]  for  the  simplex  method. 

GBP  builds  up  a  solution  to  a  linear  programming  problem;  several  linear  programming  methods 
adopt  a  similar  strategy,  solving  increasingly  complex  problems  as  constraints  or  variables  are 
iteratively  introduced  [98,  92,  81];  see  also  [60].  We  remark  that  one  method,  an  interior  point 
method  called  the  gravitational  method  [77,  18],  can  be  shown  to  be  equivalent  to  GBP  when 
applied  to  the  problem  dual  to  (4).  Empirically,  the  gravitational  method  is  faster  than  standard 
methods  on  some  problems  [18],  which  is  consistent  with  our  results. 


3  The  Geometry  of  Basis  Pursuit 


GBP  is  based  on  computatonal  geometry,  specifically  on  the  following  geometric  interpretation  of 
BP.  Given  a  signal  x  and  a  dictionary  £>,  let  conv(P)  denote  the  convex  hull  of  T>\  the  vertices  of  the 
facet  of  conv(V )  intersected  by  the  vectors  are  the  atoms  in  the  minimum  i^-norm  representation 
qf  x. 

To  see  this,  treat  the  signal  as  a  vector  and  the  atoms  as  points  in  M,d.  First  consider  the  set 
of  signals  that  have  representations  a  such  that  ||a||i  =  1.  By  definition,  this  is  the  convex  hull  of 
the  dictionary 


conv(P)  =  <  x 


x  — 


=  oti'i/Ji  and  oti  —  1,  cq  >  0 


1  I  iex  iex  ) 

Note  that  because  ||^||2  =  1,  conv(D)  is  a  polytope  inscribed  in  the  unit  sphere.  Let  x^  be  the 
point  of  intersection  between  the  vector  x  and  the  boundary  of  conv(P).  x^  lies  on  the  boundary 
of  conv(P)  and  can  be  represented  as  a  linear  combination  of  the  vertices  of  the  facet  of  conv(D) 
containing  xp;  call  this  facet  Ex.  This  representation  is  the  minimum  ^1-norm  representation:  its 
^1-norm  is  1,  and  it  is  impossible  to  construct  a  representation  with  ^-norm  less  than  1.  The 
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Figure  1:  A  geometric  interpretation  of  Basis  Pursuit.  The  signal  vector  x  intersects  the  facet  Fx 
of  the  convex  hull  of  the  dictionary,  shown  in  gray.  The  vertices  of  Tx,  if  i  and  ^2,  are  the  atoms 
in  the  Basis  Pursuit  representation  of  x. 

minimum  t^-norm  representation  of  x  is  simply  a  scaling  of  the  minimum  ^-norm  representation 
of  xp,  and  the  atoms  in  the  representation  are  the  same.  See  Figure  1.  (Note  that  if  we  know  the 
atoms  in  a  representation  of  x  it  is  straightforward  to  calculate  the  corresponding  coefficients.) 

Thus  BP  is  equivalent  to  finding  the  facet  of  conv(P)  which  intersects  x.  Computing  this 
intersection  is  known  to  reduce  to  linear  programming  [90];  to  our  knowledge,  the  converse  is 
known  [15]  but  never  utilized  to  solve  linear  programming.  We  use  this  equivalence  to  drive  GBP. 

A  previous  geometric  interpretation  of  sparse  representation  [14]  recognizes  that  in  two  dimen¬ 
sions  BP  computes  representations  with  atoms  that  ‘enclose’  x.  The  intepretation  provided  here 
can  be  viewed  as  the  generalization  of  this  notion  to  higher  dimensions. 

4  The  Greedy  Basis  Pursuit  Algorithm 

Given  the  equivalence  between  BP  and  finding  the  facet  of  the  convex  hull  of  the  dictionary  that 
intersects  the  signal  vector,  we  propose  Greedy  Basis  Pursuit  (GBP).  GBP  computes  the  minimum 
C-norm  representation  by  searching  for  this  facet  directly. 
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The  main  idea  behind  GBP  is  to  find  the  facet  of  interest  by  iteratively  ‘pushing’  a  hyperplane 
onto  the  surface  of  the  convex  hull  of  the  dictionary  until  it  coincides  with  the  supporting  hyperplane 
containing  the  facet.  This  approach  is  inspired  by  gift-wrapping  methods  [17,  64,  99]  for  the  convex 
hull  problem  in  computational  geometry  [91].  To  adapt  gift- wrapping  to  the  problem  of  finding 
a  particular  facet,  we  need  to  specify  how  the  initial  hyperplane  is  chosen  and  the  direction  in 
which  the  ‘wrapping’  proceeds  at  each  iteration.  Below  we  describe  the  GBP  algorithm,  prove  its 
convergence,  and  discuss  implementation  issues. 

4.1  The  main  algorithm 

GBP  takes  as  input  a  signal  x  G  W*  and  an  overcomplete  dictionary  V  =  where  n>d  and 

Vi,  'i/ji  G  M.d  and  HV’zIk  =  1,  and  outputs  a  representation  of  x  as  a  set  of  indices  1  C  {1, . . . ,  n}  and 
a  corresponding  set  of  coefficients  A  =  {cqjiex  such  that  x  =  Note  we  assume  that  if 

then  — ^  G  V ;  see  section  2.2. 

GBP  greedily  searches  for  the  facet  of  conv(P)  that  intersects  x,  call  it  Fx.  GBP  proceeds  by 
iteratively  constructing  a  sequence  of  hyperplanes,  H^l\  H^2\  . . .,  supporting  conv(P).  (We 
use  the  superscript  ( t )  to  denote  iteration  t .)  At  each  iteration,  GBP  maintains  a  set  of  indices 
and  a  set  of  coefficients  A^\  defining  an  approximation  to  x:  =  YliexW  ai^ii  and  a 

normal  vector  n^.  The  current  hyperplane  H®  is  defined  to  have  normal  and  contain  the 

set  Each  consecutive  hyperplane  H is  a  rotation  of  the  current  hyperplane  H® 

determined  by  x®.  GBP  stops  when  H®  contains  Fx  (and  therefore  =  x). 

4.1.1  Initialization 

As  we  do  not  a  priori  know  the  orientation  of  Fx,  we  optimistically  choose  the  initial  supporting 
hyperplane  to  have  normal  n®  =  x/ 1 |x 1 1 2 •  In  general  H will  intersect  only  one  vertex 
of  conv(P),  in  particular  the  atom  ^0,  where  zq  =  argmax^(?/V,  n^0)).  To  see  this,  consider  a 
hyperplane  with  normal  n(°)  at  some  distance  greater  than  1  away  from  the  origin;  if  we  move  this 
hyperplane  in  the  negative  normal  direction  (towards  the  origin),  the  first  point  of  conv(D)  it  will 
intersect  is  ^0.  (Note  that  this  is  also  the  first  atom  selected  by  MP  and  OMP.)  Thus  1 =  {io}; 
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this  gives  us  c^0  =  (^0JX)5  —  {^0},  and  For  convenience,  we  denote  the  set  of 

currently  selected  atoms  by  =  {V^liexC*)- 


4.1.2  Iteration 


Each  consecutive  hyperplane  is  constructed  by  rotating  in  a  2-dimensional  plane  around 

a  pivot  point  until  another  vertex  of  conv(P)  is  intersected.  The  plane  of  rotation  and  the  pivot 
point  are  defined  in  terms  of  x^.  We  define  to  be  the  best  current  approximation  to  x  using 
4/W  and  positive  coefficients,  that  is,  x®  =  X^<ex(*)  ai^ii  where  ai  >  0  and  ||x^  — x||2  is  minimized. 
Note  that  x^  is  the  projection  of  x  onto  the  convex  cone  spanned  by  4/^  with  the  origin  at  the 
apex;  we  provide  details  on  computing  x^  in  Section  4.1.3.  Let  x^  be  the  intersection  of  the 
vector  x^  with  If  is  the  (orthogonal)  distance  from  the  hyperplane  to  the  origin,  i.e. , 

dfjj  =  (^z,n),Vi  G  then 

4M47<*(,),  ■><*>))  x  (5) 


Let  denote  the  residual  vector,  r®  =  x  —  x^.  Define  to  be  the  unit  vector  in  the  direction 


of  projected  onto  H^\ 


VW  = 


p(0  —  (r(^) ,  n(^)  )n.(^) 

|rW  —  (rW,  nW)nW  | 


The  plane  of  rotation  is  the  2-dimensional  plane  defined  by  the  point  x^  and  the  vectors  and 
v^.  The  pivot  point  around  which  H  is  rotated  is  x^. 

To  compute  the  first  vertex  which  the  hyperplane  intersects  under  this  rotation,  we  order  the 
atoms  by  the  angle  6  they  form  with  v,  where  6  is  given  by 


di  =  arctan((V>i  -  x#  ,  n(t))/ (ipi  -  x#\  vw)) 


(*)  v(t)\ 


The  atom  selected  is  then  ^  where 

£;  =  argmin6^  (7) 

i 

Once  selected,  the  atom  ^  is  added  to  the  set  4/^  and  a  new  approximation  to  x  is  computed, 
x(t+1).  In  this  new  approximation,  some  atoms  in  4/^  U  {' ipk }  may  become  extraneous:  they  are 
discarded  to  form  4/(t+1);  see  Section  4.1.3  below. 
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Figure  2:  A  schematic  of  the  first  iteration  of  GBP.  The  intial  hyperplane  has  normal  in 
the  direction  of  the  signal  x  and  contains  The  atoms  are  projected  from  M,d  to  the  n^-v® 
plane  (shown)  and  sorted  by  6.  The  second  atom  selected  is  ipk,  corresponding  to  a  rotation  of 
around  to  Note  that  vW  is  orthogonal  to  the  n^-v®  plane  (and  therefore  is  not 

shown) . 

The  new  hyperplane  can  now  be  computed;  it  has  normal 

n(t+i)  =  -xg.n(f))v(f)  +  -xg,v(t))n(f) 

||  -  ('tpk  -4UW)v(t)  +  (V’fc  -  x#  ,  v(*))n(t)|| 

and  contains  x^+1^ . 

The  procedure  is  repeated  until  =  x,  that  is,  contains  Fx. 

Figure  4.1.2  provides  a  visualization  of  GBP  in  action  in  three  dimensions. 

4.1.3  Computational  details 

At  each  iteration  we  compute  x^  as  the  projection  of  x  onto  the  convex  cone  of  4/W. 

One  approach  to  computing  this  projection  is  to  maintain  an  orthogonal  basis  for  the  span  of 
updating  it  as  atoms  are  added  to  4/^,  as  in  OMP  [83];  this  is  impractical  in  our  case  as  most 
iterative  orthogonalization  procedures  are  order-dependent  and  hyperplane  rotation  may  cause  us 
to  discard  arbitrary  atoms  from  4/^. 

Instead  we  maintain  a  biorthogonal  system  consisting  of  4/W  and  the  set  of  vectors 
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Figure  3:  GBP  in  action  on  a  3-dimensional  problem.  Each  row  depicts  one  iteration,  the  left 
column  from  a  fixed  viewpoint,  the  right  column  projected  to  the  n^-v^  plane.  The  signal  vector 
is  green,  the  unselected  atoms  blue  circles,  the  selected  atoms  red  discs,  the  convex  cone  of  is 
gray,  the  normal  is  the  solid  black  line,  and  two  vectors  in  are  the  dashed  lines. 
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Algorithm  1  Greedy  Basis  Pursuit 

Input 

•  A  signal  x  £  M.d 

•  A  dictionary  V  —  {ipi}™=1, 

•  A  threshold  e  >  0 

Output 

A  representation  of  x,  consisting  of 

•  A  set  of  indices  IC 

•  A  set  of  coefficients  A  —  {cqjzez 

such  that  x  —  Y^iei  <  e 

Procedure 

1.  Initialize 

(a)  Select  the  first  atom 

k  «-  argmaxie{1)...)n}(x,V’j) 

(b)  Compute  the  initial  approximation 
ak^(x,ipk),  T(0)  <-  {k},  a(0)  <-  {ak} 

(c)  Initialize  the  biorthogonal  system 

(d)  Initialize  the  hyperplane 

x(0)  <-  atk'ipk,  n  <—  x/||x||,  r^x-x 

2.  Repeat  until  ||r||  <  e 

(a)  Compute  the  center  and  plane  of  rotation 
Xff  <-  ({A,  n)/ (x,  n))  X,  for  any  i  G  2 

v  <-  (r  -  (r,  n)n)  /||r  -  (r,  n)n|| 

(b)  Project  atoms  into  the  n-v-plane  and  select  the  next  atom 
k  <-  arg  minj  e{1)  n}  tan"1 

(c)  Compute  the  new  representation  and  update  the  biorthogonal  system 

{X,  A ,  AddAtom(x,X,  A ,  V’/o  ^r_L) 

(d)  Discard  any  extraneous  atoms 
while  3cq  <  0,  i  G  X  do 
{T,A,^±}  Subtract  Atom(x,X,  A,  i]jj, 

(e)  Update  the  hyperplane  parameters 
*^J2ielaiA 

n  -(fe-Xtf,n)v+(fe-Xff,v)n 

||-('0jfc-Xi/,n>v+(^fc-xif,v>n|| 

r  x  —  x 


biorthogonal  to  Each  element  of  tyA*)  satisfies  the  following  two  equations: 

(AAt^)  =  1 
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0,  if  j 


(9) 

(10) 


The  biorthogonal  vector  can  be  understood  as  the  component  of  'ijjf*  that  is  orthogonal  to  all 
of  the  other  vectors  in  4/^,  appropriately  scaled.  That  is,  if  we  express  an  atom  ^  E  4/^  as 


A  =  ip}®  + 


(11) 


where  is  the  component  of  ^  lying  in  the  span  of  4/^  —  {'ipi} 


*r=  E 


(12) 


and  is  orthogonal  to  the  span  of  4/^  —  {^},  then  the  biorthogonal  vector  to  is  given  by 


(13) 


Given  the  biorthogonal  system,  we  can  compute  the  current  approximation  to  x  as 

where  af*  —  (x,^^)  (14) 

iexW) 

The  sign  of  the  coefficients  indicates  whether  or  not  an  approximation  lies  in  the  convex  cone 
of  the  atoms:  if  ol{  <  0  for  some  i  then  the  approximation  does  not  lie  in  the  convex  cone;  the 
corresponding  atom  ^  is  deleted  from  the  representation  and  the  biorthogonal  system  is  updated. 

The  biorthogonal  system  and  can  be  updated  as  atoms  are  added  to  and  subtracted  from 
4/^.  Such  adaptive  biorthogonalization  methods  have  recently  been  applied  to  MP  [88,  7]  and  are 
standard  in  linear  programming  ([104],  Chapter  8).  We  present  pseudocode  for  adding  an  atom  in 
Algorithm  2.  and  pseudocode  for  substr acting  an  atom  in  Algorithm  3. 


4.2  Analysis 

By  construction,  GBP  computes  the  minimum  t^-norm  representation  of  a  given  signal.  To  prove 
this  we  show  that  GBP  converges  to  an  exact  representation  in  a  finite  number  of  steps  and  that 
the  representation  corresponds  to  a  facet  of  the  convex  hull  of  the  dictionary. 

First,  we  prove  that  GBP  converges  to  an  exact  representation.  At  each  iteration  of  GBP  there 
is  a  decrease  in  approximation  error,  as  stated  in  the  following  theorem. 
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Algorithm  2  AddAtom 
Input 

•  The  signal  x 

•  The  dictionary  V 

•  The  current  representation  X,  A 

•  The  atom  to  add  k ,  ^ 

•  The  current  biorthogonal  vectors 

Output 

•  The  updated  representation  X,  A 

•  The  updated  biorthogonal  vectors 

Procedure 

1.  Compute  the  new  biorthogonal  vector 

Vi  el,  f5i  <-  ppt^k) 

V’fc  <-  -  X iexMi 

2.  Update  the  biorthogonal  system 
Vi  e  X,  <-  ^ 

3.  Update  the  representation 
ak  <-  (x,^) 

Vi  E  X,  v 
X  X  U  {/c} 

A  <—  A  U  {a^} 


Theorem  1.  Given  a  signal  x  E  and  a  dictionary  V  —  {'ipi}] l=1,  where  n>2d,  G  D,  ^  G 
and  W^i  H2  =  1;  if  'ijji  then  —  ^  G  X>;  and  the  atoms  are  in  general  position,  if  GBP  is  run  with 
V  and  x  as  input  and  z/x®  ^  0,  at  iteration  t  +  1  of  GBP,  0  <  ||x  —  x(t+1)  H2  <  ||x  —  ||2. 

Proof  At  iteration  t,  let  5  be  the  hypersphere  centered  at  x  with  radius  ||x  —  x®  H2,  let  be  the 
next  atom  selected  by  GBP,  and  let  T  be  the  tangent  plane  to  S  at  x®.  T  contains  the  origin  (if  it 
did  not,  then  some  scaling  of  x^)  would  be  a  better  approximation  to  x),  and  thus  bisects  the  unit 
sphere.  Because  the  atoms  are  in  general  position,  n  >  2d,  and  ^  G  V  if  —  ifi  E  X,  if  |\I/V)|  <  d ? 
then  there  will  be  at  least  one  atom  in  the  same  half-space  of  T  as  x.  (Note  that  if  |^V)|  =  d,  we 
are  done,  as  we  would  also  have  x  =  x.) 

i/jk  lies  in  the  same  half-space  of  T  as  x:  by  construction,  there  is  no  atom  ^0  such  that 
(i/jo  —  >  0,  by  general  position,  there  is  no  atom  ^0  such  that  (^0  —  =  0  and 

(i/jo  ~  >  0,  and,  by  the  ordering  of  atoms  by  step  2(b)  of  GBP,  GBP  selects  an  atom  in 
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Algorithm  3  SubtractAtom 

Input 

•  The  signal  x 

•  The  current  representation  X,  A 

•  The  index  of  the  atom  to  subtract  k 

•  The  current  biorthogonal  vectors  Sfr1- 

Output 

•  The  updated  representation  X,  A , 

•  The  updated  biorthogonal  vectors  Sfr1- 

Procedure 

1.  Delete  the  atom  from  the  representation 
X^X  -  {k} 

A^  A  -  {ak} 

2.  Update  the  biorthogonal  system 

<_  4>-L  _  {^}  _ 

Vi  el,  7*  <-  ’G^/JlAlli 

Vi  €  x,  ^  <-  -  TiV’fc 

3.  Update  the  representation 

Vi  E  X,  ^k^fi 


the  same  half-space  of  T  as  x,  if  one  exists. 

ipk  lies  in  the  same  half-space  as  x,  we  can  find  a  point  x  +  e(i/jk  —  x^)  that  is  interior  to  S  and 
therefore  closer  to  x  than  x^.  Therefore  ||x  —  x^+1^||  <  ||x  —  xV)||.  □ 

Theorem  1  also  implies  that  GBP  does  not  cycle.  GBP  may  select  the  same  atom  more  than 
once,  that  is,  GBP  may  select  an  atom,  discard  it,  and  select  it  again  (this  behaviour  depends  on 
the  shape  of  the  facets  of  conv(D)),  but  GBP  will  never  revisit  the  same  state.  Because  there  are  a 
finite  number  of  states  and  GBP  improves  at  each  iteration,  GBP  converges.  By  the  same  arguments 
as  Theorem  1,  at  convergence  the  final  supporting  hyperplane  contains  a  facet  of  conv(X)  and  thus 
GBP  computes  the  minimum  ^-norm  representation. 

The  duality  of  GBP  to  the  gravitational  method  [77]  of  linear  programming,  implies  that  the 
computationaly  complexity  of  GBP  is  exponential  in  the  worst-case  [76].  Current  results  on  the 
simplex  algorithm  suggest  that  GBP  is  likely  to  be  polynomial  in  the  average  [15]  and  smoothed  [96] 
cases. 
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4.3  Implementation  Issues 


We  briefly  describe  two  obstacles  that  any  implementation  of  GBP  may  encounter,  degeneracy  and 
numerical  instability,  and  our  approach  to  handling  them. 

4.3.1  Degeneracy 

Degeneracy  occurs  when  the  atoms  of  the  dictionary  are  not  in  general  position;  the  atoms  are  in 
general  position  if  every  fc-face  of  conv(P)  contains  exactly  k  +  1  atoms  [107].  Degeneracy  can 
occur  if  the  dictionary  is  specially  designed,  for  example,  if  the  atoms  are  defined  to  be  the  vertices 
of  a  hypercube  inscribed  in  the  unit  hypersphere.  If  GBP  encounters  degeneracy,  the  updates 
described  in  Section  4.1.3  will  fail,  resulting  in  an  error.  Although  GBP  does  not  currently  include 
a  mechanism  to  detect  and  handle  degeneracy,  incorporating  such  a  feature  is  possible.  A  simple 
solution  is  to  perturb  the  atoms  of  the  dictionary  sufficiently  to  place  them  in  general  position;  see 
Section  5.1. 

4.3.2  Numerical  instability 

Numerical  instability  can  occur  in  the  biorthogonalization  stage  of  GBP.  Let  ^  be  a  matrix  cor¬ 
responding  to  for  some  £,  where  each  row  of  4/  is  an  atom  in  4/^,  and  let  4/^  denote  the 
corresponding  matrix  of  biorthogonal  vectors.  If  at  any  iteration  the  matrix  4/  is  ill-conditioned, 
the  computation  of  the  biorthogonal  vectors  we  have  described  may  be  unstable  (similar  difficul¬ 
ties  arise  in  Gram-Schmidt  orthogonalization  [89,  13]).  One  work  around  is  to  compute  a  full 
biorthogonalization  at  each  iteration,  or  at  least  whenever  instability  is  detected.  However,  a  full 
biorthogonalization  can  be  costly,  as  it  is  typically  computed  via  the  pseudoinverse  [57]:  since 
4/  ^4/^  —  I?  where  I  denotes  the  identity  matrix,  we  can  compute  4*^  as  (4/+)T,  where  c+’ 

denotes  the  pseudoinverse. 

We  instead  opt  to  compute  the  biorthogonalization  using  an  iterative  pseudoinverse  tech¬ 
nique  [9].  This  technique  takes  an  initial  estimate  of  the  pseudoinverse  and  iteratively  updates 
it,  converging  to  the  true  pseudoinverse.  If  the  initial  estimate  is  sufficiently  close  to  the  true  pseu- 
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doinverse,  then  the  iterative  pseudoinverse  computation  is  substantially  faster  than  the  standard 
pseudoinverse.  This  approach  is  well  suited  to  GBP,  as  the  adaptive  biorthogonalization  already 
provides  such  an  estimate. 


The  iterative  pseudoinverse  algorithm  proceeds  as  follows.  Given  a  matrix  St  and  an  initial 
estimate  of  the  pseudoinverse  vp+(°),  the  updates  to  Vl/+  are  computed  by 


^,+(m)  *|/+(t)  ^21  — 


(Note  that  here  t  denotes  the  iteration  of  the  pseudoinverse  algorithm,  not  the  iteration  of  GBP.) 
For  a  detailed  analysis  of  this  algorithm,  see  [95].  While  a  classic  technique,  this  algorithm  is  the 
subject  of  ongoing  research  [82,  85]. 

Our  implementation  of  GBP  tests  if  Vi/  =  I  within  a  specified  level  of  tolerance  after 

each  adaptive  biorthogonalization.  If  the  test  fails,  the  iterative  pseudoinverse  algorithm  is  applied. 


5  Results 

We  examine  the  performance  of  GBP.  We  compared  the  running  time  of  GBP  to  that  of  standard 
linear  programming  algorithms  on  three  data  sets,  random  data,  speech  data,  and  seismic  data, 
described  below.  We  also  provide  an  example  of  GBP’s  performance  on  a  single  signal  and  contrast 
it  with  that  of  Matching  Pursuit. 

In  each  experiment,  we  measured  the  running  times  of  GBP  and  standard  linear  programming 
algorithms  on  the  signal  representation  problems  described  below.  The  algorithms  we  compared 
were  GBP,  two  variants  of  the  simplex  method,  and  an  interior  point  method. 

The  implementation  of  GBP  used  was  our  own,  written  entirely  in  Matlab.  The  linear  pro¬ 
gramming  solvers  used  were  those  included  in  the  Matlab  Optimization  Toolbox  3.0  [4],  and  a 
freely  available  Matlab  implementation  [75]  of  the  revised  simplex  method  [27].  The  Optimization 
Toolbox  version  of  the  simplex  method  is  the  classical  simplex  method  [28],  with  the  initial  basis 
determined  as  in  [11].  The  Optimization  Toolbox  version  of  the  interior  point  method  is  essentially 
LIPSOL  [106],  a  freely  available  interior  point  solver  that  implements  Mehrotra’s  predictor-corrector 
method  [73,  70]. 
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For  each  problem,  all  algorithms  were  run  and  timed.  All  algorithms  were  run  under  Matlab  7 
on  a  1.5GHz  Pentium  M  processor  running  Windows  XP,  with  1.25GB  memory.  On  all  problems 
all  algorithms  returned  identical  representations  (up  to  the  specified  error  tolerance). 

5.1  Running  times:  Random  data 

The  random  data  set  consisted  of  3000  randomly  generated  signal  representation  problems,  varying 
both  the  dimension  of  the  signal  space  and  the  overcompleteness  of  the  dictionary.  Each  problem 
consisted  of  a  randomly  generated  signal  and  a  randomly  generated  dictionary.  The  dimension  d  of 
the  problems  varied  over  the  set  {8, 16,  32,  64, 128,  256}.  The  overcompleteness  k  of  the  dictionaries 
varied  over  the  set  {2,4,8,16,32}.  In  each  problem,  the  signal  x  was  randomly  generated  to  be 
uniformly  distributed  on  the  unit  hypersphere  in  M.d.  The  dictionary  for  each  problem  had  2 kd 
atoms;  the  first  kd  of  these  atoms  were  generated  in  the  same  fashion  as  the  signal,  the  second  kd 
atoms  were  the  negatives  of  the  first  kd  atoms.  Additionally,  the  dictionary  of  each  problem  was 
perturbed:  To  each  atom  was  added  Gaussian  noise  with  variance  0.000001,  after  which  the  atom 
was  normalized  to  lie  on  the  unit  hypersphere;  this  perurbation  ensures  that  the  linear  programming 
algorithms  can  compute  the  requisite  matrix  inverses;  for  structured  dictionaries  this  perturbation 
also  ensures  that  the  atoms  are  in  general  position.  For  each  d-k  pair,  100  problems  were  generated. 

Figure  4  shows  running  times  of  the  three  algorithms  as  a  function  of  overcompleteness  for  each 
dimension;  the  curve  plotted  shows  the  mean  running  time  of  each  algorithm  over  the  100  problems 
of  the  specified  dimension  and  overcompleteness,  with  error  bars  showing  the  corresponding  min¬ 
imum  and  maximum  running  times.  (We  do  not  show  the  results  of  the  revised  simplex  method 
here,  as  it  was  outperformed  by  the  Matlab’s  simplex  algorithm.) 

5.2  Running  times:  Speech  data 

The  speech  data  set  consisted  of  100  signal  representation  problems.  Each  problem  consisted  of 
a  signal  randomly  drawn  from  the  TIMIT  database  [53]  and  an  overcomplete  multiscale  Gabor 
dictionary. 

Each  signal  comprised  256  samples  (d  —  256)  and  was  randomly  selected  from  the  Train’  subset 
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Figure  4:  Running  times  for  GBP,  LIPSOL,  and  the  simplex  method  (Matlab)  on  the  random  data 
set,  plotted  as  a  function  of  overcompleteness  for  each  dimension.  Note  that  GBP’s  performance 
improves  relative  to  the  other  methods  as  the  dimensionality  of  the  problems  increase. 
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Figure  5:  Four  signals  drawn  from  the  speech  data  set.  Each  signal  consists  of  256  samples. 


Figure  6:  Four  atoms  drawn  from  the  multiscale  Gabor  cosine  dictionary. 

of  the  TIMIT  database.  The  signals  were  mean  centered  and  normalized.  Some  samples  are  shown 
in  Figure  5. 

The  dictionary  used  was  a  9 x overcomplete  multiscale  Gabor  dictionary  (4608  atoms).  The 
dictionary  consisted  of  several  fixed  scale  critically  sampled  cosine  Gabor  bases.  Each  atom  was 
defined  by  the  parameters  to  and  /  as  G(t\  to,  /)  =  (1/(27 rcr))  exp-^-^)2/0"2  cos(27 r/(t  —  to)),  where 
to  G  {0  :  At  :  1}  and  /  G  {0  :  A /  :  d/2},  with  At  =  2 J/d,  a  =  y^7r/2/At,  and  A /  =  cr/v/27r; 
the  scale  parameter  j  varied  over  {0, 1, ...  ,  8}.  See  [52,  45]  for  details  and  other  sampling  schemes. 
Once  the  atoms  were  defined,  they  were  perturbed  as  in  the  random  data  case.  Some  samples  from 
the  final  dictionary  are  shown  in  Figure  6. 

We  show  the  running  times  of  GBP,  LIPSOL,  and  the  revised  simplex  method  on  the  sound 
data  set  in  Table  1.  (The  revised  simplex  method  outperformed  Matlab’s  simplex  method.)  We 
show  the  mean,  minimum,  and  maximum  runninng  times  for  each  algorithm  on  the  100  signals. 

5.3  Running  times:  Seismic  data 

The  seismic  data  consists  of  100  signal  representation  problem.  Each  problem  consists  of  a  256 
sample  signal  of  seismic  recordings  from  the  North  Sea,  4  times  downsampled  from  the  original 
data  [94];  some  samples  are  shown  in  Figure  7.  The  dictionary  used  was  the  same  as  used  in  the 
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Algorithm 

Min 

Mean 

Max 

GBP 

40.41 

48.72 

56.35 

LIPSOL 

58.62 

75.97 

155.56 

Revised  Simplex 

441.66 

1297.65 

2700.51 

Table  1:  Running  times  of  GBP,  LIPSOL,  and  the  revised  simplex  method  on  the  sound  data  set, 
in  CPU  seconds. 


Figure  7:  Four  signals  drawn  from  the  seismic  data  set.  Each  signal  consists  of  256  samples. 

speech  experiment  above.  We  show  the  running  times  of  GBP,  LIPSOL,  and  the  revised  simplex 
method  on  the  seismic  data  set  in  Table  2. 

5.4  Example:  Speech  signal 

Figure  8  provides  an  example  comparing  GBP  to  MP  and  OMP  on  a  1024-dimensional  signal  (Figure 
8,  top  left),  selected  from  the  TIMIT  speech  database  [53],  using  a  multiscale  Gabor  dictionary 
(n  =  22528),  similar  to  the  one  used  for  the  sound  data.  (Note  that  the  other  BP  methods  were 
unable  to  compute  representations  on  problems  of  this  size  in  our  environment.)  Examining  the 


Algorithm 

Min 

Mean 

Max 

GBP 

42.29 

48.83 

55.05 

LIPSOL 

60.52 

70.36 

112.90 

Revised  Simplex 

2233.45 

2489.05 

2831.59 

Table  2:  Running  times  of  GBP,  LIPSOL,  and  the  revised  simplex  method  on  the  seismic  data  set, 
in  CPU  seconds. 
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approximation  error  of  each  algorithm  as  a  function  of  iteration  (Figure  8,  top  right),  we  observe 
that  while  the  approximation  error  of  GBP  decreases  somewhat  more  slowly  than  that  of  MP 
(note  also  that  each  iteration  of  GBP  is  more  costly),  the  error  of  GBP  does  appear  to  decrease 
approximately  exponentially.  Also  note  that  the  representation  computed  by  GBP  is  sparser  than 
that  of  MP,  though  less  sparse  than  that  of  OMP,  as  indicated  by  the  sorted-amplitudes-curves  and 
the  t^-norm  of  the  representations.  The  sorted-amplitudes-curves  [66,  62]  (Figure  8,  bottom  left) 
are  plots  of  the  logarithm  of  the  final  coefficients,  sorted  in  descending  order;  the  rates  of  decrease 
indicate  the  relative  sparsity  of  the  representations.  The  t^-norm  of  the  representation  coefficients 
are  0.3274,  0.4569,  and  0.4156  for  GBP,  MP,  and  OMP  respectively.  (Note  that  the  results  for 
GBP  would  be  the  same  as  those  for  standard  linear  programming  methods  for  BP.)  The  feature 
of  GBP  to  note  here  is  its  ‘greediness’:  the  coefficients  in  the  order  of  atom  selection  track  the 
sorted-amplitudes-curve,  that  is,  GBP  tends  to  select  significant  atoms  early  on  (Figure  8,  bottom 
right).  This  demonstrates  that  it  is  possible  to  compute  Basis  Pursuit  signal  representations  and 
to  be  greedy  at  the  same  time. 

6  Discussion 

Our  results  show  that  GBP  provides  a  fast  alternative  to  standard  linear  programming  methods  for 
sparse  signal  representation  problems,  particularly  when  the  dimension  of  the  signal  space  is  high 
and  the  dictionary  is  very  overcomplete.  While  there  are  a  variety  of  factors  which  may  contribute 
to  the  results,  there  are  several  algorithmic  reasons  why  we  expect  GBP  to  perform  well  relatively. 

The  efficient  solution  of  linear  programming  problems  depends  in  a  complicated  way  on  the 
problem,  the  method  of  solution  and  its  implementation,  and  the  available  resources;  see  Bixby  [12]. 
Thus  the  relative  success  of  GBP  compared  to  the  linear  programming  methods  implemented  in 
the  Matlab  Optimization  Toolbox  is  partially  a  function  of  the  specific  methods  used  and  their 
implementation.  There  are  many  available  linear  programming  packages  [48],  some  specific  to  sparse 
representation  include  Atomizer  [1],  G-MAGIC  [2],  and  SparseLab  [3].  An  exhaustive  comparison 
of  GBP  against  all  of  these  methods  is  out  of  the  scope  of  the  present  paper.  However,  while 
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Iteration 


Figure  8:  An  example  comparing  GBP  to  MP  on  a  speech  signal  (a),  (b)  The  log  of  the  error 
(T2-norm  of  the  residual)  as  a  function  of  iteration,  (c)  The  sorted-amplitudes-curves;  observe  that 
GBP  produces  a  sparser  representation  than  MP.  (d)  The  (final)  coefficient  values,  in  order  of 
atom  selection.  (Note  that  the  coefficient  values  change  in  GBP  at  each  iteration.)  See  text  for 
discussion. 

the  linear  programming  methods  against  which  GBP  was  compared  may  not  represent  the  current 
state-of-the-art,  it  is  worth  noting  that  GBP  itself  has  the  potential  for  significant  speed  increases 
through  more  efficient  implementation. 

Algorithmically,  GBP  has  several  advantages  over  standard  linear  program  solvers.  First,  most 
linear  program  solvers  assume,  for  historical  reasons,  that  the  constraint  matrix  is  sparse,  and  they 
therefore  rely  on  techniques  that  exploit  this  sparsity,  whether  or  not  sparsity  is  actually  present  [22]. 
The  signal  representation  problems  considered  here  are  not  particularly  sparse  in  this  sense:  the 
random  dictionary  matrices  used  in  the  random  data  set  are  certainly  not  sparse,  while  the  Gabor 
dictionary  matrices  used  with  the  sound  and  seismic  data  sets  are  somewhat  sparse,  however  the 
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matrices  are  not  sufficiently  structured  for  certain  fast  algorithms  to  be  applicable  [20,  21].  GBP 
does  not  exploit  this  sparsity,  and  therefore  does  not  suffer  when  it  is  not  present.  Second,  GBP 
is  efficient  in  the  search  for  the  next  atom  to  select,  because  this  search  is  based  on  a  geometric 
criterion  that  involves  2  projections  per  possible  atom.  Simplex  methods  can  be  inefficient  at  this 
task  as  the  search  can  involve  evaluating  more  than  2,  even  d,  projections  per  possible  atom;  see 
[104,  101].  Third,  the  updates  in  GBP  are  seldom  of  a  full  basis,  further  reducing  computation. 
Finally,  the  complexity  of  the  simplex  method  depends  on  the  closeness  of  the  initial  solution  to 
the  optimal  solution,  which  in  turn  depends  on  the  phase  I  algorithm  by  which  the  initial  basis  is 
selected.  GBP  does  not  depend  on  an  intial  solution;  in  fact,  GBP  can  be  interpreted  as  a  combined 
phase  1/  phase  II  linear  programming  algorithm. 

One  area  which  we  have  not  explored  that  merits  further  investigation  is  the  dependence  of 
the  performance  of  GBP  (and  other  sparse  representation  algorithms)  on  the  structure  of  the 
dictionary.  For  example,  a  dictionary  optimized  for  use  with  MP  [30]  or  OMP  [44]  may  well  have 
very  different  properties  from  one  optimized  for  BP.  The  design  of  dictionaries  has  only  recently 
received  attention  in  the  signal  processing  community  [30,  44,  5]  (for  work  in  neural  computation, 
see  [80,  69]);  our  work  suggests  that  the  geometric  properties  of  dictionaries  play  a  crucial  role  in 
both  the  efficiency  of  representation  algorithms  and  the  quality  of  the  resulting  representations. 
Indeed,  geometric  considerations  have  already  led  to  a  better  theoretical  understanding  of  sparse 
signal  representation  [39,  38]. 

As  noted,  part  of  the  motivation  for  the  development  of  BP  is  the  observation  that  MP  and 
OMP  can  fail  to  find  sparse,  in  the  ^°-norm  sense,  signal  representations  [21],  with  much  theoretical 
work  showing  under  exactly  what  conditions  BP  finds  sparse  representations,  i.e.,  when  the  minimal 
^1-norm  solution  is  equivalent  to  the  minimal  t?°-norm  solution  [37,  36,  51].  These  findings  have 
made  BP  useful  for  areas  beyond  signal  representation,  including  compressed  sensing  [35]  and  error 
correcting  codes  [16],  thus  GBP  may  prove  useful  in  these  domains. 
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7  Conclusions 


We  have  described  GBP,  a  new  algorithm  for  Basis  Pursuit,  and  demonstrated  that  it  is  faster 
than  standard  linear  programming  methods  on  some  problems,  particularly  in  high-dimensional 
signal  spaces  using  very  overcomplete  dictionaries.  A  Matlab  implementation  of  GBP  is  currently 
available  online  at:  http :  //www .  cs .  yale .  edu/~huggins/gbp .  html 

Computational  geometry  has  traditionally  been  the  preserve  of  computer  science,  particularly 
computer  graphics  and  theoretical  computer  science;  its  use  here  in  the  development  of  GBP 
highlights  the  relevance  of  computational  geometry  to  signal  processing.  GBP  also  illustrates  the 
interplay  between  signal  processing  and  linear  programming.  That  an  efficient  linear  programming 
algorithm  falls  naturally  out  of  sparse  signal  representation  is  surprising,  and  suggests  that  re¬ 
searchers  in  signal  processing  should  not  view  linear  programming,  or  optimization  in  general,  as 
a  black  box:  on  one  hand  signal  processing  naturally  defines  a  set  of  problems  that  can  serve  to 
drive  research  in  linear  programming,  on  the  other  hand,  given  the  historical  parallels,  optimization 
research  deserves  deeper  examination  by  the  signal  processing  community. 
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Abstract 

Data  fusion  and  multi-cue  data  matching  are  fundamental  tasks  in  high  dimensional  data  analysis.  In  this 
paper,  we  apply  the  recently  introduced  diffusion  framework  to  address  these  tasks.  Our  contribution  is  three-fold. 

First,  we  present  the  Laplace-Beltrami  approach  to  computing  density  invariant  embeddings  which  are  essential 
for  integrating  different  sources  of  data.  Second,  we  describe  a  refinement  of  the  Nystrom  extension  algorithm 
called  “geometric  harmonics”.  We  also  explain  how  to  use  this  tool  for  data  assimilation.  Finally,  we  introduce  a 
multi-cue  data  matching  scheme  based  on  nonlinear  spectral  graphs  alignment. 

The  effectiveness  of  the  proposed  scheme  is  validated  by  applying  it  to  the  lip  reading  and  image  sequence 
alignment. 
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I.  Introduction 

The  processing  of  massive  high-dimensional  data  sets  is  a  contemporary  challenge.  Suppose  that  a 
source  s  produces  high-dimensional  data  {x1  ,  that  we  wish  to  analyze.  For  instance,  each  data 
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point  could  be  the  frames  of  a  movie  produced  by  a  digital  camera,  or  the  pixels  of  a  hyperspectral  image. 
When  dealing  with  this  type  of  data,  the  high-dimensionality  is  an  obstacle  for  any  efficient  processing 
of  the  data.  Indeed,  many  classical  data  processing  algorithms  have  a  computational  complexity  that 
grows  exponentially  with  the  dimension  (this  is  the  so-called  “curse  of  dimensionality”).  On  the  other 
hand,  the  source  s  may  have  a  limited  number  of  degrees  of  freedom.  In  this  case,  the  high  dimensional 
representation  of  the  data  is  an  unfortunate  (but  often  necessary)  artifact  of  the  choice  of  sensors  or  the 
acquisition  device.  This  means  that  the  data  have  a  low  intrinsic  dimensionality,  or  equivalently,  that 
many  of  the  variables  that  describe  each  data  points  are  highly  correlated,  at  least  locally.  Therefore  it 
is  possible  to  obtain  low-dimensional  representations  of  the  samples.  Note  that  since  the  variables  are 
correlated  only  locally,  classical  global  dimension  reduction  methods  like  Principal  Component  Analysis 
and  Multidimensional  Scaling  do  not  provide,  in  general,  an  efficient  dimension  reduction. 

First  introduced  in  the  context  of  manifold  learning,  eigenmaps  techniques  [1],  [2],  [3],  [4]  are  becoming 
increasingly  popular  as  they  overcome  this  problem.  Indeed,  they  perform  a  nonlinear  reduction  of  the 
dimension  by  providing  a  parametrization  of  the  data  set  that  preserves  neighborhoods.  However,  the  new 
representation  that  one  obtains  is  highly  sensitive  to  the  way  the  data  points  were  originally  sampled.  More 
precisely,  if  the  data  are  assumed  to  approximately  lie  on  a  manifold,  then  the  eigenmap  representation 
depends  on  the  density  of  the  points  on  this  manifold  [5].  This  issue  is  of  critical  importance  in  applications 
as  one  often  needs  to  merge  data  that  were  produced  by  the  same  source  but  acquired  with  different  devices 
or  sensors,  at  various  sampling  rates  and  possibly  on  different  occasions.  In  that  case,  it  is  necessary  to  have 
a  canonical  representation  of  the  data  that  retains  the  intrinsic  constraints  of  the  samples  (e.g.  manifold 
geometry)  regardless  of  the  particular  distribution  of  the  datasets  sampled  by  different  devices. 

Another  important  issue  is  that  of  data  matching.  This  question  arises  when  one  needs  to  establish  a 
correspondence  between  two  data  sets  resulting  from  the  same  fundamental  source.  For  instance,  consider 
the  problem  of  matching  pixels  of  a  stereo  image  pair.  One  can  form  a  graph  for  each  image,  where  pixels 
constitute  the  nodes,  and  where  edges  are  weighted  according  to  the  local  features  in  the  image.  The 
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problem  now  boils  down  to  matching  nodes  between  two  graphs.  Note  that  this  situation  is  an  instance  of 
multi-sensor  integration  problem,  in  which  one  needs  to  find  the  correspondence  between  data  captured  by 
different  sensors.  In  some  applications,  like  fraud  detection,  synchronizing  data  sets  is  used  for  detecting 
discrepancies  rather  than  similarities  between  data  sets. 

The  out-of-sample  extension  problem  is  another  aspect  of  the  data  fusion  problem.  The  idea  is  to 
extend  a  function  known  on  a  training  set  to  a  new  point  using  both  the  target  function  and  the  geometry 
of  training  domain.  The  new  point  and  the  corresponding  value  of  the  function  can  then  be  assimilated 
to  the  training  set.  This  is  an  essential  component  in  any  scheme  that  agglomerates  knowledge  over  an 
initial  data  set  and  then  applies  the  inferred  structure  to  new  data.  Recently,  Belkin  et  al  have  developed 
a  solution  to  this  problem  via  the  concept  of  manifold  regularization  [6].  Earlier,  several  authors  used 
the  Nystrom  extension  procedure  in  the  Machine  Learning  context  [7],  [8]  in  order  to  extend  eigenmap 
coordinates.  In  both  cases,  the  question  of  the  scale  of  the  extension  kernel  remains  unanswered.  In  other 
words,  given  an  empirical  function  on  a  data  set,  to  what  distance  to  the  training  set  can  this  function 
be  extended  ?  In  particular,  given  the  spectral  embedding  of  the  data  set,  which  kernel  should  be  used  to 
extend  it? 

By  relating  the  frequency  content  of  the  target  function  on  the  training  set  to  the  extrinsic  Fourier 
analysis,  Coifman  et  al  provide  an  answer  to  this  question  [9].  They  developed  the  idea  of  “geometric 
harmonics”  based  on  the  Nystrom  extension  at  different  scales,  providing  a  multiscale  extension  scheme 
for  empirical  functions.  We  apply  this  concept  to  the  extension  of  spectral  embeddings  and  show  that 
the  extension  has  to  be  conducted  using  a  specially  designed  kernel  which  differs  from  data  embedding 
kernel. 

In  this  article,  we  show  that  the  questions  discussed  above  can  be  efficiently  addressed  by  the  general 
diffusion  framework  introduced  in  [10],  [11].  The  main  idea  is  that,  just  like  for  eigenmaps  methods, 
eigenvectors  of  Markov  matrices  can  be  used  to  embed  any  graph  into  a  Euclidean  space  and  achieve 
dimension  reduction.  Building  on  these  ideas,  the  contribution  of  this  paper  is  three-fold:  first,  we  show 
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that  by  carefully  normalizing  the  Markov  matrix,  the  embedding  can  be  made  invariant  to  the  density  of  the 
sampled  data  points,  thus  solving  the  problem  of  data  fusion  encountered  with  other  eigenmaps  methods. 
Then,  we  address  the  problem  of  out-of-sample  extension,  and  we  explain  how  to  extend  empirical 
functions  to  new  samples  using  the  geometric  harmonics.  Last,  we  take  advantage  of  the  density-invariant 
representation  of  data  sets  provided  by  the  diffusion  coordinates  to  derive  a  simple  data  matching  algorithm 
based  on  geometrical  embeddings  alignment. 

The  proposed  scheme  is  experimentally  verified  by  applying  it  to  visual  data  analysis.  First,  we  address 
the  problem  of  automatic  lip  reading  by  embedding  the  lips  images  using  the  Laplace-Beltrami  and  deriving 
an  automatic  lip  reading  scheme  where  new  data  is  assimilated  using  geometric  harmonics.  Second,  we 
demonstrate  the  multi-cue  data  matching  aspect  of  our  work  by  matching  image  sequences  corresponding 
to  similar  head  motions. 

This  paper  is  organized  as  follows:  we  start  by  recalling  the  diffusion  framework,  and  the  notion  of 
diffusion  maps  in  Section  II.  We  then  explain  in  Section  II-B  how  to  normalize  the  diffusion  kernel  in 
order  to  separate  the  geometry  (constraints)  of  the  data  from  the  distribution  of  the  points.  We  describe  the 
out-of-sample  extension  procedure  via  the  geometric  harmonics  in  Section  II-D  and  present  a  nonlinear 
algorithms  for  matching  two  data  sets.  Last,  we  illustrate  these  ideas  by  applying  it  to  lip  reading  and 
sequence  alignment  in  Section  III. 

II.  The  diffusion  framework 
A.  Diffusion  maps  and  diffusion  distances 

Let  fl  =  {xi, xn }  be  n  data  points.  In  this  section,  we  recall  the  diffusion  framework  as  described 
in  [5],  [12].  The  main  point  of  this  set  of  techniques  is  to  introduce  a  useful  metric  on  data  sets  based 
on  the  connectivity  of  points  within  the  graph  of  the  data,  and  also  to  provide  coordinates  on  the  data  set 
that  reorganize  the  points  according  to  this  metric. 

The  first  step  in  our  construction  is  to  view  these  points  as  being  the  nodes  of  a  symmetric  graph  in 
which  two  nodes  are  connected  by  an  edge  if  they  are  very  similar.  The  very  notion  of  similarity  between 
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two  data  points  is  completely  application-driven.  In  many  situations,  each  data  point  is  a  collection  of 
numerical  measurements  and  can  be  thought  of  as  a  point  in  a  Euclidean  feature  space.  In  this  case, 
similarity  is  measured  in  terms  of  closeness  in  this  space,  and  it  is  custom  to  weight  the  edge  between  Xi 
and  Xj  by  exp(||xj  —  Xj\\2 /e),  where  e  >  0  is  a  scale  parameter.  More  generally,  we  allow  ourselves  to 
consider  arbitrary  weight  functions  that  verify  the  following  two  conditions,  for  all  x  and  y  in  Q: 

•  it  is  symmetric:  w(x,y )  =  w(x,y), 

•  it  is  pointwise  non-negative:  w(x,y )  >  0. 

The  weight  function  or  kernel  describes  the  first-order  interaction  between  the  data  points  as  it  defines 
the  nearest  neighbor  structures  in  the  graph.  The  analysis  of  the  data  provided  by  the  diffusion  techniques 
depends  heavily  on  the  choice  of  the  weight  function. 

Following  a  classical  construction  in  spectral  graph  theory  [13],  namely  the  normalized  graph  Laplacian, 
we  now  create  a  random  walk  on  the  data  set  Q  by  forming  the  following  kernel: 


Pi{x,y) 


w(x,y) 
d(x)  ’ 


where  d{x)  =  XLeo  w(x,  z )  is  the  degree  of  node  x. 

Since  we  have  that  pi(x,  y)  >  0  and  YLyen P1  (x>  V)  =  the  quantity  pi(x,  y)  can  be  interpreted  as  the 
probability  of  a  random  walker  to  jump  from  x  to  y  in  a  single  time  step.  If  P  is  the  n  x  n  matrix  of 
transition  of  this  Markov  chain,  then  taking  powers  of  this  matrix  amounts  to  running  the  chain  forward 
in  time.  Let  pt(-,  •)  be  the  kernel  corresponding  to  the  tth  power  of  the  matrix  P.  In  other  words,  pt(-,  •) 
describes  the  probabilities  of  transition  in  t  time  steps. 

The  asymptotic  behavior  of  this  random  walk  has  been  used  to  find  clusters  in  the  data  set  [13],  [14] 
where  the  first  non-constant  eigenfunction  is  used  as  a  classification  function  into  two  clusters.  More 
recently,  using  the  other  eigenfunctions  was  considered  [15].  For  t  =  +oo,  this  Markov  chain  is  governed 
by  a  unique  stationary  distribution  (4>0,  which  means  that 


lim  pt(x,y)  =  0o (y)  • 

t — sH-oo 
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The  vector  0O  is  the  top  left  eigenvector  of  P,  i.e.,  (f^P  =  $ ,  and  it  can  be  checked  that  <po(y)  is  given 


by 


My) 


d(y) 
d(z)  ' 


It  can  be  shown  [12]  that  the  pre-asymptotic  regime  is  governed  according  to  the  following  eigendecom- 
position 

Pt(%,  v)  =  Yl  xiMx)Mv) >  (!) 

/>0 

where  {A;}  is  the  sequence  of  eigenvalues  of  P  (with  |A0|  >  |Ai|  >  ...)  and  and  {Pi\  are  the 
corresponding  left  and  right  eigenvectors  (see  the  appendix  for  a  proof).  Furthermore,  because  of  the 
spectrum  decay,  only  a  few  terms  are  needed  to  achieve  a  given  relative  accuracy  5  >  0  in  the  previous 
sum.  Let  m(t)  be  this  number. 

Unifying  ideas  from  Markov  chains  and  potential  theory,  the  diffusion  distance  between  two  points  x 
and  ^  was  introduced  in  [12],  [5]  as 


D2t(oc,z) 


E 


(Mxpj)  ~Pt(^,y))2 

Mv) 


(2) 


This  quantity  is  simply  a  weighted  L 2  distance  between  the  conditional  probabilities  pt(x,  ■),  and  pt(z,  •). 
These  probabilities  can  be  thought  of  as  features  attached  to  the  points  x  and  2,  and  they  measure  the 
influence  or  interaction  of  these  two  nodes  with  the  rest  of  the  graph.  If  one  increases  t,  one  propagates 
the  local  or  short-term  influence  of  each  node  to  its  nearest  neighbors,  and  this  means  that  t  also  plays 
the  role  of  a  scale  parameter.  The  comparison  of  these  conditional  probabilities  introduces  a  notion  of 
proximity  that  accounts  for  the  connectivity  of  the  points  in  the  graph.  In  particular,  unlike  the  shortest 
path,  or  geodesic  distance,  this  metric  is  robust  to  noise  as  it  involves  an  integration  along  all  paths  of 
length  t  starting  from  x  or  z. 

The  connection  between  the  diffusion  distance  and  the  eigenvectors  goes  as  follows  (see  appendix): 


Dt(x,z)  =  J2x‘itMx)-Mz))2- 

l>  1 


(3) 


71 


Note  that  the  ip0  does  not  appear  in  the  sum  because  it  is  constant.  This  identity  means  that  the  right 
eigenvectors  can  be  used  to  compute  the  diffusion  distance.  Furthermore,  because  of  the  spectrum  decay, 
only  a  few  terms  are  needed  to  achieve  a  given  relative  accuracy  S  >  0  in  the  previous  sum.  Let  m(t)  be 
this  number,  and  define  the  diffusion  map 


/ 


vht  :  x 


>&h{x) 


(4) 


This  mapping  provides  coordinates  on  the  data  set  Cl,  and  embeds  the  n  data  points  into  the  Euclidean 
space  RmW.  In  addition,  the  spectrum  decay  is  the  reason  why  dimension  reduction  can  be  achieved. 
This  method  constitutes  a  universal  and  data-driven  way  to  represent  a  graph  or  any  generic  data  set  as  a 
cloud  of  points  in  a  Euclidean  space.  We  also  obtain  a  complete  parametrization  of  the  data  that  captures 
relevant  modes  of  variability.  Moreover,  the  dimension  m(t)  of  the  new  representation  only  depends  on  the 
properties  of  the  random  walk  on  the  data,  and  not  on  the  number  of  features  of  the  original  representation 
of  the  data.  In  particular,  if  we  increase  t,  then  rn(t)  decreases  and  we  capture  the  large  scale  geometry 
of  the  data. 


B.  Data  merging  using  the  Laplace-Beltrami  normalization 

We  now  direct  our  attention  to  the  case  when  the  original  data  points  Cl  =  {.r, . ...,  xn }  are  assumed  to 
approximately  lie  on  a  submanifold  M.  of  Wl.  The  so  called  “manifold  model”  holds  for  a  large  variety 
of  situations,  such  as  when  the  data  is  produced  by  a  source  controlled  by  a  few  free  parameters.  For 
instance,  consider  the  rotation  of  a  human  head  and  the  lips  motion  of  a  speaker.  We  will  study  these 
examples  later  in  this  paper. 

On  the  manifold  M,  the  data  points  were  sampled  with  a  density  q(-)  that  may  reflect  some  important 
aspect  of  the  phenomenon  that  generated  the  data.  For  instance,  as  described  in  [12],  for  some  data  sets, 
the  density  is  related  to  the  free  energy  surface  that  governs  the  samples.  On  the  other  hand,  the  density 
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may  depend  on  the  acquisition  process  and  may  be  unrelated  to  intrinsic  geometry  or  dynamics  of  the 
underlying  phenomenon.  In  this  situation,  the  distribution  of  the  points  is  an  artifact  of  the  sampling 
process,  and  consequently,  any  “good”  representation  of  the  data  should  be  invariant  to  the  density. 

Classical  eigenmap  methods  provide  an  embedding  that  combines  the  information  of  both  the  density 
and  geometry.  For  instance,  with  the  Laplacian  eigenmaps  [2],  one  starts  by  forming  the  graph  with 
Gaussian  weights  we(x,y)  =  exp(  — ||x  —  y\\2/e),  and  then  constructs  the  random  walk  as  described  in 
the  previous  section.  The  eigenvectors  are  then  used  to  embed  the  data  set  into  a  Euclidean  space.  It  was 
shown  in  [10]  that  in  the  large  sample  limit  n  — *  +oo  and  small  scale  £  — *  0,  the  eigenvectors  tend  to 
those  of  the  Schrodinger  operator  A  +  E,  where  A  is  the  Laplace-Beltrami  operator  on  M,  and  E  is  a 
scalar  potential  that  depends  on  the  density  q.  As  a  consequence,  the  Laplacian  eigenmaps  representation 
of  the  data  heavily  depends  on  the  density  of  the  data  points.  In  particular,  it  makes  it  impossible  to  fuse 
two  data  sets  obtained  from  the  same  sensors  but  with  different  densities. 

In  order  to  solve  this  problem,  we  suggest  to  renormalize  the  Gaussian  edge  weights  w£(-,  •)  with  an 
estimate  of  the  density  and  to  form  the  random  walk  on  this  new  graph.  This  is  summarized  in  Algorithm 
1. 

Let  P£  be  the  transition  matrix  with  entries  p£  (•,■)•  The  asymptotics  for  P£  are  given  in  the  following 
theorem. 

Theorem  1:  In  the  limit  of  large  sample  and  small  scales,  we  have 

lim  lim  - -  =  A  . 

£ — ^0  71^+00  £ 

In  particular,  the  eigenvectors  of  P£  tend  to  those  of  the  Laplace-Beltrami  operator  on  M.  We  refer  to  [5] 
for  a  proof.  This  result  shows  that  the  diffusion  embedding  that  one  obtains  from  an  appropriately  renor¬ 
malized  Gaussian  kernel  does  not  depend  on  the  density  q  of  the  data  points  of  M.  This  algorithm  allows 
to  successfully  capture  the  nonlinear  constraints  governing  the  data,  independently  from  the  distribution 
of  the  points.  In  other  words,  it  separates  the  geometry  of  the  manifold  from  the  density. 
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Algorithm  1  Approximation  of  the  Laplace-Beltrami  diffusion 
i:  Start  with  a  rotation-invariant  kernel  w£(x,  y)  =  h  ^ 

2:  Let 


and  form  the  new  kernel 


<k(x)  =  ^w£(x,y) , 

yen 


w£(x,y) 


we(x,y) 

q£(x)q£(y)  ' 


3:  Apply  the  normalized  graph  Laplacian  construction  to  this  kernel,  i.e.,  set 


d£(x)  =  ^2w£(x,y) , 
zen 

and  define  the  anisotropic  transition  kernel 


Pe{x,y) 


We&y) 

d£(x) 


(5) 


C.  Out-of-sample  extension  and  the  geometric  harmonics 

In  most  applications,  it  is  essential  to  be  able  to  extend  the  low  dimensional  representation  computed 
on  a  training  set  to  new  samples.  Let  O  be  a  data  set  and  T,  be  its  diffusion  embedding  map.  We  now 
present  the  geometric  harmonic  scheme  that  allows  us  to  extend  to  a  new  data  set  0.  Since  we  need 
to  relate  the  new  samples  to  the  training  set,  we  will  assume  that  fl  is  a  subset  of  a  Euclidean  space  Rd. 

The  focal  point  of  our  extension  scheme  is  the  distinction  between  the  embedding  kernel  we  used  to 
compute  d'f  on  Q  and  the  extension  kernel  ka  used  to  extend  onto  the  new  data  set  0.  It  was  shown 
in  [9]  that  the  properties  required  for  the  expansion  kernel  ka  are  significantly  different  than  the  ones  of 
w£  and  somewhat  contradicting.  In  particular,  while  computing  one  strives  to  use  as  small  a  scale  yfe 
as  possible,  while  for  the  expanding  kernel  ka  one  would  use  a  scale  factor  a  as  large  as  possible.  The 
geometric  harmonic  algorithm  was  first  introduced  in  [9]  and  is  based  on  the  idea  of  using  the  Nystrom 
extension  to  expand  the  eigenvectors  of  the  specially  designed  kernel  ka  from  fi  to  Q.  These  eigenvectors 
form  a  basis  that  can  be  used  to  extend  any  function  /  given  on  O  to  and  in  particular  the  vector 
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function  \Eq.  In  our  application  we  used  a  Gaussian  extension  kernel  ka(x,y )  =  e-^-^2/0'2  to  extend 
computed  by  the  Laplace-Beltrami  kernel  given  in  Equation  5.  In  previous  works  [7],  the  Nystrom 
extension  was  used  to  extend  \Eq  using  the  same  kernel  we. 

Next,  we  discuss  the  design  of  the  extension  kernel  ka  and  provide  a  scheme  for  its  computation  in 
Algorithm  2.  The  design  is  based  on  finding  an  equilibrium  between  a,  the  width  of  the  extension  kernel 
ka  and  the  reconstruction  error  of  the  function  /  on  Q  using  only  a  subset  of  the  eigenvectors  of  ka.  On 
the  one  hand,  we  aim  to  increase  a  as  much  as  possible  to  maximize  the  extension  range.  But  on  the  other 
hand,  as  shown  below,  this  also  increases  the  reconstruction  error  of  /.  Hence,  the  reconstruction  error 
limits  the  maximal  extension  range.  In  fact,  this  limitation  can  be  regarded  as  relating  the  complexity  of 
the  function  on  the  training  set  to  the  distance  to  which  it  can  be  extended  off  this  set.  Here,  the  notion 
of  complexity  is  measured  in  terms  of  frequency  content  on  the  training  domain.  For  instance,  a  constant 
function  has  almost  no  complexity  and  one  should  be  able  to  extend  it  in  the  entire  space.  If  the  number 
of  oscillations  of  this  function  increases,  then  the  distance  to  which  one  can  extend  it  gets  smaller. 

We  first  recall  the  idea  of  Nystrom  extension  [16].  Let  a  >  0  be  a  scale  of  extension,  and  consider  the 
eigenvectors  and  eigenvalues  of  a  Gaussian  kernel  of  width  a  on  the  training  set  Q: 

ym(x)  =  ^2e~^x~v^2/a2(pi(y)  where  x  e  fl . 

yen 

Since  the  kernel  can  be  evaluated  the  entire  space,  it  is  possible  to  take  any  x  G  in  the  right-hand  side 
of  this  identity.  This  yields  the  following  definition  of  the  Nystrom  extension  of  cpj  from  fl  to  Md: 

Tpi{x)  =  —  ^  e~^x~y^2 /a2 ipi(y)  where  x  E  W1 .  (6) 

;,/  yen 

Numerically,  we  have  extended  Lpi  only  to  a  distance  a  from  il.  Such  an  extension  is  termed  “geometric 
harmonic”.  As  the  eigenvectors  tpi  form  an  orthonormal  set,  an  arbitrary  function  /  can  be  extended  from 
fl  to  Rd  by  expressing  it  as  a  linear  combinations  of  the  geometric  harmonics  </?/. 

However,  as  it  can  be  seen  from  Equation  6  and  from  the  fact  that  //;  — >•  0,  the  extension  of  some  s 
is  an  ill-posed  linear  operation.  Indeed,  the  extension  of  the  first  (/  +  1)  eigenfunctions  of  the  Gaussian 
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kernel  has  a  condition  number  equal  to  The  only  way  to  control  the  conditioning  of  this  procedure 

is  to  perform  regularization  by  retaining  only  the  coefficients  for  which  //0  / /i|  <  r/  (where  q  is  a  bound 
on  the  condition  number  that  plays  the  role  of  a  regularization  parameter): 

/-  ^2  (vuf)vi- 

l:Ho<rUH 

This  approximation  generates  an  error  of  reconstructing  /  on  il.  Therefore  if  we  fix  an  admissible  error 
threshold  r  >  0,  and  check  whether  this  error  is  smaller  or  larger  than  r.  In  the  former  case,  the  function 
/  has  a  low-frequency  content  and  can  safely  be  extended  at  scale  a.  In  the  latter  case,  a  non-negligible 
energy  is  lost  in  high  frequency  coefficients,  and  /  cannot  be  extended  at  scale  a.  Consequently,  the  scale 
cr  has  to  be  reduced.  A  smaller  a  results  in  a  slower  decay  of  the  eigenvalues  //2  and  an  improved  condition 
number  q.  These  observations  give  rise  to  the  multiscale  extension  scheme  summarized  in  Algorithm  2. 

D.  Multi-cue  alignment  and  data  matching 

The  purpose  of  this  section  is  to  explain  how  the  diffusion  embedding  can  be  efficiently  used  for  data 
matching.  Suppose  that  one  has  two  data  sets  =  {xi,...,xn}  and  02  =  {yi, .... yn/}  for  which  one 
would  like  to  find  a  correspondence,  or  detect  similar  patterns  and  trends,  or  on  the  contrary,  underline 
their  dissimilarity  and  detect  anomalies.  This  type  of  task  is  very  common  in  applications  related  to 
marketing,  fraud  detection  or  even  counter-terrorism.  However,  working  with  the  data  in  its  original  form 
can  be  quite  difficult  as  the  two  sets  typically  consist  of  measurements  of  very  different  nature.  For 
instance  Qi  could  be  a  collection  of  measurements  related  to  wether  in  a  given  region,  whereas  02  could 
describe  agriculture  production  in  the  same  region.  As  a  consequence,  it  is  almost  always  impossible 
to  directly  compare  the  two  data  sets.  The  main  idea  that  we  introduce  here  is  that  the  diffusion  maps 
provide  a  canonical  representation  of  data  sets.  This  new  representation  is  based  on  the  graph  structure  of 
a  set,  which  is  often  the  relevant  structure  in  the  context  of  data  matching.  As  a  consequence,  instead  of 
comparing  the  sets  in  their  original  forms,  it  can  be  much  more  efficient  to  compare  their  embeddings.  In 
particular,  if  fii  and  fi2  are  expected  to  have  similar  structures,  then  they  should  have  similar  embeddings. 
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Algorithm  2  Multiscale  extension  scheme  of  diffusion  coordinates  via  geometric  harmonics 
i:  Let  Cl  C  be  the  training  set  and  ipi  :  0  — >  R  be  the  diffusion  coordinate  to  be  extended  (1  <  i  < 

m(t)).  Choose  a  condition  number  r]  >  0  and  an  admissible  error  r  >  0. 

2:  Choose  an  initial  (large)  scale  of  extension  a  =  a0. 

3:  Compute  the  eigenfunctions  of  the  Gaussian  kernel  with  width  a  on  the  training  set  Cl: 

=  y ^e~l|a~y||2/<7V;(y)  where  x  E  Cl , 
and  expand  /  on  this  orthonormal  basis  (on  the  training  set  Cl): 


f(x)  =  52  ci<pi(x)  where  x  E  Cl . 


i>  o 


4:  Compute  the  error  of  reconstruction  on  the  training  set  that  one  obtains  by  retaining  only  the 
coefficients  such  that  rj  >  hq/ Hi  in  the  sum  above: 


Err  = 


£  w2 

iN</*o/w 


If  Err  >  t  then  divide  a  by  2  and  go  back  to  point  3.  Otherwise  continue. 
5:  For  each  l  such  that  r]  >  IM)/ !M,  extend  (pi  via  the  Nystrom  procedure: 

■d~\\x~v\\2 /a‘2 


( Pi(x )  =  ~52e  l|X  where  x  E 


j/eo 


and  define  the  extension  /  of  /  to  be 


f(x )  =  ciip^x)  where  x  E  . 
i>  o 


We  illustrate  these  ideas  by  presenting  a  semi-supervised  algorithm  for  finding  a  one-to-one  correspon¬ 
dence  between  two  data  sets.  The  scheme  we  introduce  consists  in  aligning  two  graphs  in  a  nonlinear 
fashion,  based  on  a  finite  number  of  landmarks.  More  precisely,  suppose  that  we  have  k  <  n,n'  landmarks 
in  each  set,  that  is  a  sequence  of  k  pairs  (x^i), yr(i) ),...,  (aV(fc), yT(k))  for  which  there  is  a  known 
correspondence.  This  set  of  examples  is  the  only  prior  information  we  use  in  the  algorithm.  We  assume  that 
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^cr(i)  7^  xcr(2)  7^  -7^  ^cr(fc)  •  The  scheme  given  in  Algorithm  3  computes  a  surjective  function  g  :  Qi  — »•  02 

SUCh  that  p(2'o-(l))  Vt{\)i  •••■>  9^p^a(k))  2/r(fc)- 

Algorithm  3  Nonlinear  graph  alignment 
i:  Start  with  k  landmarks  {xa{1),  yr{1) (xa{k),  yr(k)). 

2:  Compute  the  diffusion  embeddings  {(E>t(xi), &t(xn)}  and  $t(yn/)}  of  fli  and  f ^  where 

t  is  chosen  so  that  at  least  k  —  1  eigenvectors  are  retained. 

3:  Compute  the  affine  function  /  :  Mfc_1  — >  Rfe_1  that  satisfies 

1))  2/r(l)  >  •  •  •>  £/r(/c)  • 

4:  Define  the  correspondence  between  Qi  and  02  by 

#(£;)  =  arg niin{||/(x;)  -  j/||}, 

2/go2 

where  x*  e  fli. 


The  number  of  eigenvectors  used  for  the  alignment  is  directly  related  to  the  number  of  landmarks, 
which  in  turns,  represents  the  quantity  of  prior  information  for  aligning.  The  larger  the  number  of  known 
constraints  on  the  alignment,  the  larger  the  dimensionality  of  the  aligning  mapping.  This  observation  is 
consistent  with  the  fact  that  higher  order  eigenvectors  capture  finer  structures.  Note  also  that  the  linear 
function  that  we  use  for  aligning  induces  a  nonlinear  mapping  defined  on  lower  dimensional  embedding 
of  the  sets.  These  observations  pave  the  way  for  a  general  sampling  theory  for  data  sets.  Indeed,  the 
landmarks  can  be  regarded  as  forming  a  subsampling  of  the  original  data  sets.  This  subset  determines 
the  largest  (or  Nyquist)  frequency  used  to  represent  the  original  set.  This  frequency  is  measured  as  the 
number  of  eigenvectors  used. 
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III.  Experimental  results 


A.  Application  to  lip-reading 

The  validity  of  our  approach  is  now  demonstrated  by  applying  it  to  lip  reading  and  sequence  alignment, 
which  are  typical  high-dimensional  data  analysis  problems.  In  particular,  lip  reading  has  gained  significant 
attention  [17],  [18],  [19],  [20],  [21]  and  we  now  provide  background  and  previous  results  in  that  field. 
The  ultimate  goal  of  lip  reading  is  to  design  human-like  man-machine  interfaces  allowing  automatic 
comprehension  of  speech,  which  in  the  absence  of  sound  is  denoted  as  lip-reading  and  the  synthesis  of 
realistic  lip  movement.  The  design  of  such  a  system  involves  three  main  challenges:  first,  the  feature 
extraction,  which  aims  at  converting  the  images  of  the  lips  into  a  useful  description,  must  be  achieved 
with  minimal  preprocessing.  Then,  in  order  to  be  efficiently  processed,  the  data  must  be  transformed  via 
a  dimension  reduction  technique.  Last,  in  order  to  assimilate  new  data  for  recognition,  one  must  be  able 
to  perform  data  fusion. 

Previous  schemes  have  mainly  focused  on  the  first  two  points.  Concerning  the  feature  extraction,  some 
works  [17],  [21]  analyze  directly  the  intensity  values  of  the  input  images,  while  others  [22],  [18]  start  by 
detecting  curves  and  points  of  interest  around  the  mouth  whose  locations  are  then  used  as  features.  The 
combination  of  audio-visual  cues  was  used  in  [23]  where  the  visual  cues  are  the  extracted  lip  contours 
which  are  tracked  over  time.  We  note  that  combining  audio-visual  is  beyond  the  scope  of  this  work  and 
will  be  dealt  by  us  in  the  future.  Identifying,  tracking  and  segmenting  the  lips  is  a  difficult  task  and  possible 
solutions  include:  active  contours  [24],  probabilistic  models  [25]  and  the  combination  of  multiple  visual 
cues  (shape,  color  and  motion)  [26]  to  name  a  few.  In  practice,  one  strives  to  use  a  simple  preprocessing 
scheme  as  possible  and  in  our  scheme  we  employ  a  simple  stabilization  scheme  discussed  below. 

Regarding  the  dimensionality  reduction,  several  schemes  have  been  used.  Preliminary  work  employed 
linear  algorithms  such  as  the  PCA  and  SVD  subspace  projections  [22],  [21],  For  instance,  Li  et  al  [21] 
use  a  linear  PCA  scheme  similar  to  the  eigenfaces  approach  to  face  detection.  Recognition  is  performed 
by  correlating  an  input  sequence  with  the  eigenfeatures  obtained  from  PCA.  More  recent  schemes  [17] 
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utilize  non-linear  approaches  such  as  the  MDS  [27].  Some  of  the  techniques  provide  a  general  embedding 
framework  for  lipreading  analysis  [17],  while  others  [21],  [18]  concentrate  on  a  particular  task  such  as 
phoneme  or  word  identification.  The  work  in  [28]  is  of  particular  interest,  since  it  is  one  of  the  first 
to  explicitly  formulate  the  lipreading  problem  as  a  “Manifold  Learning”  issue  and  tries  to  derive  the 
inherent  constraints  embedded  in  the  space  of  lip  configurations.  A  Hidden  Markov  Model  (HMM)  is 
used  to  model  a  small  number  of  words  (names  of  four  drinks)  which  define  the  Markov  states  and  the 
manifold.  The  HMM  is  then  used  to  recognize  the  drinks’  names  where  the  input  is  given  by  tracking 
the  outer  lips  contour  using  Active  Contours.  Utilizing  both  audio  and  visual  information  significantly 
decreased  the  error  rate,  especially  in  noisy  environments.  Kimmel  and  Aharon  [17]  applied  the  MDS 
scheme  to  visual  lips  representation,  analysis  and  synthesis.  A  set  of  lips  images  is  aligned  and  embedded 
in  a  two  dimensional  domain  which  is  then  sampled  uniformly  in  the  embedding  domain  to  achieve 
uniform  density.  The  pronunciation  of  each  word  is  defined  as  a  path  over  the  embedding  domain  and 
used  for  visual  speech  recognition,  by  path  matching.  Lips  motion  synthesis  is  derived  by  computing  the 
geodesic  path  over  the  embedding  domain,  where  the  start  and  end  point  are  given  as  input. 

Analysis  of  lip  data  constitutes  an  application  where  it  is  important  to  separate  the  set  of  nonlinear 
constraints  on  the  data  from  the  distribution  of  the  points.  As  an  illustration  of  the  Laplace-Beltrami 
normalization  as  well  as  the  out-of-sample  extension  scheme,  we  now  describe  an  elementary  experiment 
that  paves  the  way  to  building  automatic  lip-reading  machines,  and  more  generally,  machine  learning 
systems. 

We  recorded  a  movie  of  the  lips  of  a  subject  reading  a  text  in  English.  The  subject  was  then  asked  to 
repeat  each  digit  “zero”,  “one”,  ...  ,  “nine”  40  times.  A  minimal  preprocessing  was  applied  to  the  recorded 
sequence.  More  precisely,  it  was  first  converted  from  colors  to  gray  level.  Moreover,  using  a  marker  put 
at  the  tip  of  the  nose  of  the  speaker  during  the  recording,  we  were  able  to  automatically  crop  each  frame 
into  a  rectangular  area  around  the  lips.  Each  of  these  new  frames  was  then  regarded  as  a  point  in  M140xll°, 
where  140  x  110  is  the  size  of  the  cropped  area. 
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The  first  part  of  the  data  sets,  consisting  of  approximately  5000  frames,  corresponds  to  the  speaker 
reading  the  text.  These  points  were  used  to  form  a  graph  with  Gaussian  weights  exp(||xj  —  Xj\\2/e)  on 
the  edges,  for  an  appropriately  chosen  scale  e  >  0.  The  distance  \\xi  —  Xj\\  was  merely  calculated  as  the 
Euclidean  L 2  distance  between  frames  i  and  j.  We  then  renormalized  the  Gaussian  weights  using  the 
Laplace-Beltrami  normalization  described  in  Section  II-B.  By  doing  so,  our  analysis  focused  on  viewing 
the  mouth  as  a  constrained  mechanical  system.  In  order  to  obtain  a  low-dimensional  parametrization  of 
these  nonlinear  constraints,  we  computed  the  diffusion  coordinates  on  this  new  graph.  The  embedding  in 
the  first  3  eigenfunctions  is  shown  on  Figure  1. 


Fig.  1.  The  embedding  of  the  lip  data  into  the  top  3  diffusion  coordinates.  These  coordinates  essentially  capture  two  parameters:  one 
controlling  the  opening  of  the  mouth  and  the  other  measuring  the  portion  of  teeth  that  are  visible. 


The  task  we  wanted  to  perform  was  word  recognition  on  a  small  vocabulary.  The  example  that  we 
considered  was  that  of  identification  of  digits.  Each  word  “zero”,  “one”,...,  “nine”  is  typically  a  sequence 
25  to  40  frames  that  we  need  to  project  in  the  diffusion  space.  In  order  to  do  so,  we  used  the  geometric 
harmonic  extension  scheme  presented  in  Section  II-C  to  extend  each  diffusion  coordinate  to  the  frames 
corresponding  to  the  subject  pronouncing  the  different  digits.  After  this  projection,  each  word  can  be 
viewed  as  a  trajectory  in  the  diffusion  space.  The  word  recognition  problem  now  amounts  to  identifying 
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trajectories  in  the  diffusion  space. 

We  can  now  build  a  classifier  based  on  comparing  a  new  trajectory  to  a  collection  of  labelled  trajectories 
in  a  training  set.  We  randomly  selected  20  instances  of  each  digit  to  form  a  training  set,  the  remaining 
20  being  used  as  a  testing  set.  In  order  to  compare  trajectories  in  the  diffusion  space,  a  metric  is  needed, 
and  we  chose  to  use  the  Hausdorff  distance  between  two  sets  and  r2,  defined  as 

dijiY i ,  To)  =  max  <  max  min  \\\xi  —  Xolll,  max  min  {\\xi  —  T2lli 

(x2er2  xieri  xieiTx2er2  1 

Although  this  distance  does  not  use  the  temporal  information,  it  has  the  advantage  of  not  being  sensitive  to 
the  choice  of  a  parametrization  or  to  the  sampling  density  for  either  set  and  T2.  For  a  given  trajectory 
T  from  the  testing  set,  our  classifier  is  a  nearest-neighbor  classifier  for  this  metric,  i.e.,  the  class  of  T  is 
decided  to  be  that  of  the  nearest  trajectory  (for  dH)  in  the  training  set.  The  performance  of  this  classifier 
averaged  over  100  random  trials  is  shown  in  Table  I.  In  this  case,  the  data  set  was  embedded  in  15 
dimensions. 


“0” 

“2” 

“3” 

“4” 

“5” 

“6” 

“8” 

zero 

0.93 

0 

0 

0.01 

0 

0 

0.06 

0 

0 

0 

one 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

two 

0.05 

0 

0.88 

0.05 

0.01 

0 

0.01 

0 

0 

0 

three 

0.01 

0 

0.02 

0.93 

0 

0 

0.01 

0.01 

0.01 

0.01 

four 

0 

0 

0.01 

0.01 

0.97 

0 

0 

0.01 

0 

0 

five 

0 

0 

0 

0.01 

0 

0.84 

0.01 

0.14 

0 

0.01 

six 

0.04 

0 

0 

0.01 

0 

0 

0.92 

0.02 

0 

0.01 

seven 

0.02 

0 

0 

0.04 

0 

0.07 

0.10 

0.69 

0.05 

0.03 

eight 

0 

0.01 

0 

0 

0 

0.03 

0.01 

0.04 

0.77 

0.14 

nine 

0 

0 

0 

0.02 

0 

0 

0 

0.02 

0.12 

0.85 

TABLE  I 

Classifier  performance  over  100  random  trials.  Each  row  corresponds  the  classification  distribution  of  a  given 

DIGIT  OVER  THEN  10  CLASSES.  THE  DATA  SET  WAS  EMBEDDED  IN  15  DIMENSIONS. 
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The  classification  error  ranges  from  0%  to  31%  with  an  average  of  12.2%.  The  best  classification  rate  is 
achieved  for  the  word  “one”  which,  in  terms  of  visual  information,  stands  far  away  from  the  other  digits. 
In  particular,  typical  sequences  of  “one”  involve  frames  with  a  round  open  mouth,  with  no  teeth  visible 
(see  first  row  of  Figure  2).  These  frames  essentially  never  appear  for  other  digits.  The  worst  classification 
job  is  for  the  word  “seven”  which  seems  to  be  highly  confused  with  the  words  “five”  and  “six”.  As  shown 
on  Figure  2,  typical  instances  of  these  words  appear  to  be  similar  in  that  the  central  frames  involve  an 
open  mouth  with  visible  teeth.  In  the  case  of  the  “six”  and  “seven”,  teeth  from  the  lower  jaws  are  visible 
because  of  the  “s”  sound.  Regarding  the  similarity  between  “five”  and  “seven”,  the  ”f”  and  ”v”  sounds 
translate  into  the  lower  lip  touching  the  teeth  of  the  upper  jaw. 


Fig.  2.  Typical  frames  for  the  words  “one”,  “five”,  “six”,  “seven”. 

B.  Synchronization  of  head  movement  data 

We  now  illustrate  the  concept  of  graph  alignment  as  well  as  the  algorithm  presented  in  Section  II-D.  We 
recorded  3  movies  of  subjects  wearing  successively  a  yellow,  red  and  black  mask.  Each  subject  was  asked 
to  move  their  head  in  front  of  the  camcorder.  We  then  considered  the  three  sets  consisting  of  all  frames 
of  each  movie.  Let  YELLOW,  RED  and  BLACK  denote  these  sets.  Our  goal  was  to  synchronize  the 
movements  of  the  different  masks  by  aligning  the  3  diffusion  embeddings.  It  is  to  be  noted  that  working 
directly  in  image  space  would  be  highly  inefficient  since  any  picture  of  the  red  or  black  mask  is  at  a 
large  distance  from  the  set  of  pictures  of  the  yellow  mask.  On  the  contrary,  the  diffusion  coordinates  will 
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capture  the  intrinsic  organization  of  each  data  sets,  and  therefore  will  provide  a  canonical  representation 
of  the  sets  that  can  be  used  for  matching  the  data. 

Each  set  of  frames  was  regarded  as  a  collection  of  points  in  M10000,  where  the  dimensionality  coincides 
with  the  number  of  pixel  per  image.  Following  the  lines  of  our  algorithm,  we  formed  a  graph  from  each 
set  with  Gaussian  weights  exp(||a:j  —  Xj\\2/e),  for  an  appropriately  chosen  scale  e  >  0.  Here  \\xi  —  Xj || 
represents  the  L2  norm  between  images  i  and  j.  We  expect  each  set  to  lie  approximately  on  a  manifold  of 
dimension  2,  as  each  subject  essentially  moved  their  head  along  two  angles  a  and  ft  shown  on  Figure  3. 
and  as  the  light  conditions  were  kept  the  same  during  the  recording. 


a 


Fig.  3.  Each  subject  essentially  moved  their  head  along  the  two  angles  a  and  f3.  There  was  almost  no  tilting  of  the  head.  Hence,  the  data 
points  approximately  lie  on  a  submanifold  of  dimension  2. 


It  is  clear  that  the  density  of  points  on  this  manifold  is  essentially  arbitrary  and  varies  with  each  subject 
and  recording.  Since  we  were  only  interested  in  the  space  of  constraints,  that  is  the  geometry  of  the 
manifold,  we  renormalized  the  Gaussian  weights  according  to  the  algorithm  described  in  Section  II-B, 
and  constructed  a  Markov  chain  that  approximates  the  Laplace-Beltrami  diffusion.  We  then  defined  8 
matching  triplets  of  landmarks  in  each  set.  The  landmarks  were  chosen  to  correspond  to  the  main  head 
positions.  We  computed  the  diffusion  embedding  in  7  dimensions  and  we  then  calculated  two  affine 
functions  <]YYt  :  M7  — >  M7  and  <]YB  :  R7  — >  R7  that  match  the  landmarks  from  YELLOW  to  BLACK,  and 
from  YELLOW  to  RED. 

Two  conclusions  can  be  drawn  from  this  experiment.  First,  the  diffusion  embedding  revealed  that  the 
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ata  sets  were  approximately  2-dimensional,  as  expected  (see  Figure  4  for  the  embeddings  in  the  first  3 
diffusion  coordinates).  The  diffusion  coordinate  captured  the  main  parameters  of  variability,  namely  the 
angles  a  and  (5.  Second,  the  two  functions  gYB  and  gYR  allowed  us  to  drive  the  movements  of  the  black 
and  red  masks  from  those  of  the  yellow  mask.  The  result  of  the  matching  of  the  three  data  sets  is  shown 
on  Figure  5. 


Fig.  4.  The  embedding  of  each  set  in  the  first  3  diffusion  coordinates.  The  color  encodes  the  density  of  points. 
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Fig.  5.  The  embedding  of  the  YELLOW  set  in  three  diffusion  coordinates  and  the  various  corresponding  images  after  alignment  of  the 
RED  and  BLACK  graphs  to  YELLOW. 


85 


IV.  Conclusion  and  future  work 


In  this  work  we  introduced  diffusion  techniques  as  a  framework  for  data  fusion  and  multi-cue  data 
matching  by  addressing  several  key  issues.  First,  we  underlined  the  importance  of  the  Laplace-Beltrami 
normalization  for  data  fusion  by  showing  that  it  allows  to  merge  data  sets  produced  by  the  same  source 
but  with  different  densities.  In  particular,  the  Laplace-Beltrami  embedding  provides  a  canonical,  density 
invariant  embedding  which  is  essential  for  data  matching.  For  example,  matching  the  visual  data  of 
different  speakers  and  the  “rotating  heads”  sequence.  Second,  we  suggested  a  new  data  fusion  scheme,  by 
extending  spectral  embeddings  using  the  geometric  harmonics  framework.  Finally,  we  presented  a  spectral 
graph  alignment  approach  to  data  fusion. 

Our  scheme  was  successfully  applied  to  lip  reading  where  we  achieved  high  accuracy  with  minimal 
preprocessing.  We  also  demonstrated  the  alignment  of  high  dimensional  visual  data  (“rotating  heads” 
sequence). 

In  the  future  we  intend  to  extend  our  approach  to  multi-cue  data  analysis,  by  integrating  different 
features  in  a  multigraph,  constructed  by  combining  the  graphs  of  the  different  features  over  the  data  set. 
Finally,  we  are  studying  a  spectral  based  approach  to  the  analysis  of  signals  as  Markov  random  processes. 
Our  current  work  did  not  utilize  the  temporal  information  of  the  video  sequences,  whose  frames  were 
considered  as  samples  of  a  random  variable.  By  constructing  a  Markov  process  model,  we  intend  to 
improve  the  lips  reading  accuracy  using  the  Viterbi  algorithm. 
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Appendix 

Diffusion  distance  and  eigenfunctions 

The  random  walk  constructed  from  a  graph  via  the  normalized  graph  Laplacian  procedure  yields  a 

Markov  matrix  P  with  entries  pi(x,y).  As  it  is  well  known  [13],  this  matrix  is  in  fact  conjugate  to  a 
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symmetric  matrix  A  with  entries  a(x,  y ),  given  by 


a(x,y) 


Pi{x,y) 


w(x,y) 

y/d(x)d(y) 


Therefore  A  has  n  eigenvalues  A0, An_i  and  orthonormal  eigenvectors  v0j ...,  vn-\.  In  particular, 

n—  1 

a(x,  y)  =  ^  A ivi(x)vi(y) .  (7) 

1=0 

This  implies  that  P  has  the  same  n  eigenvalues.  In  addition,  it  has  n  left  eigenvectors  0O>  •••>  <f>n- i  and  n 
right  eigenvectors  ip0, ....  ipn-i.  Also,  it  can  be  checked  that 


My)  =  Mv)vo (y)  and  Mx)  =  Vi(x)/v0(x) .  (8) 

Furthermore,  it  can  be  verified  that  vq(x)  =  s/d(x),  and  therefore  4>o{y)  =  d(y)  and  %)(x)  =  1.  In 
addition, 

4>0(x)Mx)  =  Mx)  ■  (9) 


It  results  from  Equations  7  and  8  that  Pf  admits  the  following  spectral  decomposition: 

n— 1 

Ptix ,  y)  =  Yl  x\Mx)My) ,  (10) 

1=0 

together  with  the  biorthogonality  relation 


^2MyMj(y)  =  (ii) 

yef2 


where  8ij  is  Kronecker  symbol.  Combining  this  last  identity  with  Equation  9,  one  obtains 

Mv)<f>j(y) 


£- 

2/eO 


My) 


=  ■ 


This  means  that  the  system  {M  is  orthonormal  in  L2(f 2, 1/0O)-  Therefore,  if  one  fixes  x,  Equation  10 
can  interpreted  as  the  decomposition  of  the  function  pt(x,-)  over  this  system,  where  the  coefficients  of 
decomposition  are  {A fyi(x)}. 

Now  by  definition, 


Dt{x,z )2  = 

yEfl 


(pt(x,y)  zpMi))1 
My) 


WpMx)  -Pt(z,-) Wh^l/to)  ■ 
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Therefore, 


n—  1 

AO, y)2  =  X?(MX)  -  ^/O))2  • 

1=0 
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Abstract 

Data  fusion  and  the  analysis  of  high-dimensional  multisensor  data,  are  fundamental  tasks 
in  many  research  communities.  In  this  work  we  propose  a  unified  embedding  scheme  for  multi 
sensory  data,  which  is  based  on  the  recently  introduced  diffusion  framework.  Our  scheme  is 
purely  data-driven  and  assume  no  a-priory  knowledge  of  the  underlying  statistical  or  deter¬ 
ministic  models.  Our  approach  is  based  on  embedding  separately  each  of  the  input  channels 
and  combining  the  resulting  diffusion  coordinates.  In  particular  we  use  the  density  invariant 
Laplace-Beltrami  embedding.  In  order  to  verify  the  efficiency  of  our  approach,  we  apply  it 
to  typical  multisensory  statistical  learning  and  clustering  applications,  such  as  spoken-digit 
recognition  and  multi-cue  image  segmentation.  For  both  applications  we  experimentally  show 
that  using  the  unified  multisensor  embedding,  allows  better  performance  than  the  one  achieved 
by  any  single  sensor. 


90 


1  Introduction 


The  first  task  performed  by  any  data  processing  system  is  data  acquisition  or  sampling,  in  which 
measurements  are  collected  through  a  number  of  sensors.  In  this  work,  we  refer  to  a  “sensor”  any 
information  stream  produced  by  an  acquisition  device  or,  more  generally,  any  descriptor  used  to 
represent  some  form  of  data.  Single-sensor  systems,  which  process  data  coming  from  a  unique  in¬ 
formation  channel,  have  been  successfully  used  in  various  context  ranging  from  object  recognition 
(e.g.  Sonar)  to  the  medical  area  (e.g.  blood  pressure  sensors).  However,  it  was  early  recognized 
that  these  systems  typically  suffer  from  incompleteness  due  to  the  fact  that  a  single  sensor  is  almost 
never  sufficient  to  capture  all  of  the  relevant  information  related  to  a  phenomenon.  For  instance, 
in  medical  imaging  different  sensors,  such  as  X-Ray,  CT,  MRI  and  others,  asses  different  physical 
properties.  This  issue  was  further  studied  in  the  context  of  remote-sensing  (SAR,  FLIR,  IR  and 
optical  sensors).  In  particular,  different  sensors  are  subject  to  different  limitations  restricting  their 
usability.  For  example,  in  remote  sensing,  optical  sensors  have  significantly  better  resolution  and 
lower  SNR  than  Radar  based  SAR  sensors,  yet  SAR  sensors  are  immune  to  atmospheric  conditions 
and  can  be  used  in  any  weather  conditions.  The  multisensor  approach  allows  to  resolve  ambiguities 
and  reduce  uncertainties  that  may  arise  in  some  situations,  such  as  object  recognition.  For  exam¬ 
ple,  consider  the  work  by  Kidron  et.  al.  [1]  who  detected  image  pixels  within  a  video  sequence 
that  were  related  to  the  creation  of  sound,  given  the  visual  and  audio  data.  Using  only  the  visual 
data  was  insufficient  as  some  of  the  motions  in  a  scene  were  unrelated  to  the  sound  creation. 

Note  also  that  many  living  species  rely  heavily  on  a  multisensor  approach  (most  humans  can 
see,  hear,  taste...).  In  particular,  the  fusion  of  audio-visual  cues  was  shown  to  enhance  perception 
[2,  3],  Last,  it  is  often  more  cost-efficient  to  combine  a  variety  of  cheap  sensors  rather  than  to  deal 
with  an  expensive  single  sensor. 

The  use  of  high-dimensional  multisensor  signals  requires  several  tasks.  First,  the  signals  have 
to  be  embedded  in  a  low-dimensional  space  that  recovers  the  underlying  manifold.  When  the  dif- 
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ferent  data  sources  are  not  synchronized  and  have  to  be  aligned,  this  manifold  can  also  be  used  for 
alignment  [4].  In  particular,  as  different  sources  might  sample  the  same  phenomenon  with  differ¬ 
ent  densities,  the  alignment  requires  a  density-invariant  embedding,  as  eigenmap  representations 
[5,  6,  7,  8]  depend  on  the  density  of  the  points  on  the  underlying  manifold. 

A  second  task  is  the  alignment  and  synchronization  of  different  multisensor  sources.  This  was 
extensively  studied  in  the  remote  sensing  and  medical  imaging  communities.  In  such  applications, 
due  to  the  different  physical  characteristics  of  various  imaging  sensors,  the  relationship  between 
the  intensities  of  matching  pixels  is  often  complex  and  unknown  a  priori.  The  common  approach 
to  multisensor  image  alignment  is  to  compute  canonical  representations  of  image  features,  which 
are  invariant  to  the  dissimilarity  between  the  different  sensors  and  capture  the  essence  of  the  im¬ 
age.  Theses  representations  include  geometrical  primitives  such  as  feature  points,  contours  and 
comers  [9,  10,  11].  Such  approaches  apply  a  deterministic  a-priori  know  model  that  relates  the 
measurements  of  the  different  input  channels. 

A  general  purpose  approach  to  high-dimensional  data  embedding  and  alignment  was  presented 
by  Ham  et.  al  [12],  Given  a  set  of  a-priori  corresponding  points  in  the  different  input  channels,  a 
constrained  formulation  of  the  graph  Laplacian  embedding  is  derived.  First,  they  add  a  term  fixing 
the  embedding  coordinates  of  certain  samples  to  predefined  values.  Both  sets  are  then  embedded 
separately,  where  certain  samples  in  each  set  are  mapped  to  the  same  embedding  coordinates. 
Second,  they  describe  a  dual  embedding  scheme,  where  the  constrained  embeddings  of  both  sets 
are  computed  simultaneously,  and  the  embeddings  of  certain  points  in  both  datasets  are  constrained 
to  be  identical. 

Kidron  et.al  [1]  applied  canonical  correlation  analysis  to  multisensor  event  detection.  Their 
approach  uses  a  parametric  form  of  the  covariance  matrices  to  compute  maximally  correlated  one¬ 
dimensional  embeddings  of  the  audio  and  video  input  signals.  A  sparsity  constraint  was  applied  to 
regularize  the  otherwise  underconstrained  embedding  problem,  where  the  constraint  corresponds 
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to  the  sparsity  of  the  detected  events. 

There  is  also  a  large  body  of  literature  in  engineering  related  to  multisensor  integration.  These 
approaches  can  be  classified  into  three  categories  [13].  First,  some  techniques  are  based  on  physical 
models  for  the  data,  like  in  the  case  of  Kalman  filtering.  Another  category  corresponds  to  methods 
employing  a  parametric  model  for  the  data  or  the  sensors.  For  instance  this  is  the  case  of  Bayesian 
inference,  of  the  Dempster-Shafer  method  or  Neural  Networks.  These  techniques  usually  exhibit  a 
high  sensitivity  to  the  accuracy  of  these  models  [14].  The  third  group  consists  of  cognitive-based 
methods,  which  aim  at  mimicking  human  inference.  One  of  the  main  tools  is  fuzzy  logic.  But 
there  again,  one  needs  to  specify  subjective  membership  functions.  It  therefore  appears  that  many 
of  these  techniques  rely  on  prior  information. 

A  problem  related  to  data  fusion  is  the  fusion  of  multiple  partitionings  [15].  The  focal  point 
there  is  to  fuse  together  different  partitionings ,  rather  than  different  data  sources  as  in  the  general 
data  fusing  problem.  This  approach  boils  down  to  embedding  the  data  in  a  one-dimensional  space 
(the  partitions  index).  As  this  is  not  a  metric  space,  a  distance  metric  can  be  defined  directly  and 
the  work  in  [15]  uses  the  co-association  matrix  as  a  binary  similarity  measure. 

A  related  problem  was  recently  studied  by  the  computer  vision  community  in  the  context  of 
multi-cue  image  segmentation.  These  works  are  of  particular  interest,  as  (similar  to  our  approach) 
they  are  based  on  spectral  embeddings  [16].  In  [17]  Yu  presents  a  segmentation  scheme  that  inte¬ 
grates  edges  detected  at  multiple  scales.  These  are  shown  to  provide  complementary  segmentation 
cues.  Given  the  affinity  matrices  computed  using  the  edges  at  each  scale,  a  simultaneous  segmen¬ 
tation  is  computed  using  a  novel  criterion  called  average  cuts.  This  approach  does  not  explicitly 
assume  the  cues  are  multiscale  and  can  be  applied  to  using  different  cues  rather  than  a  single  cue 
in  different  scales.  Other  works  [18,  19],  deal  with  the  fusion  of  a  single  multiscale  cue  in  images 
and  can  be  applied  directly  to  multisensor  data 

In  this  work  we  derive  a  unified  low-dimensional  representation,  given  as  set  of  different  input 
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channels  related  to  a  particular  phenomenon.  We  assume  that  the  input  signals  are  aligned  and 
derive  a  unified  representation  of  them,  useful  for  statistical  learning  tasks  and  data  partitioning. 
We  compute  a  unified  low-dimensional  representation  and  show  that  it  combines  the  information 
encoded  in  the  different  signals.  Thus,  is  better  able  to  parameterize  complex  phenomena.  We 
start  by  computing  low-dimensional  embeddings  of  each  of  the  input  signals  using  the  diffusion 
framework  [20,  21]  and  for  that  we  utilize  the  Laplace-Beltrami  density  invariant  scheme  [22],  The 
multisensor  scheme  is  first  applied  to  statistical  learning  by  analyzing  audio-visual  based  spoken¬ 
digit  recognition  and  we  compare  our  result  to  the  results  of  the  visual-only  lip-reading  given  in 
[4],  Then  we  apply  it  to  multi-cue  image  segmentation,  where  the  multisensor  data  is  related 
to  different  image  cues:  RGB,  contours  and  texture.  Compared  to  prior  works,  the  presented 
approach  does  not  require  any  deterministic  model  of  the  data  or  its  statistics  (covariance  matrices 
etc.),  and  the  structures  that  they  recover  are  completely  data-driven.  In  particular,  we  resolve  the 
density-dependence  issue  of  the  embeddings  that  was  largely  overlooked  in  prior  works. 

This  paper  is  organized  as  follows:  We  describe  the  foundations  of  the  diffusion  based  embed¬ 
dings  and  introduce  the  unified,  fused  multisensor  embedding  in  Section  2.  Our  scheme  is  then 
experimentally  verified  in  Section  3,  while  concluding  remarks  and  future  extensions  are  discussed 
in  Section  4. 


2  Multi-sensor  integration 

In  this  section  we  present  the  proposed  data  fusion  scheme.  We  start  by  describing  low-dimensional 
spectral  embeddings  and  then  extend  them  to  derive  the  density-invariant  Laplace-Beltrami  em¬ 
bedding.  A  more  detailed  description  can  be  found  in  [4],  while  the  mathematical  foundations  are 
given  in  [22].  Given  a  set  0  =  {xi, ...,  xn}  of  data  points,  we  start  by  constructing  a  weighted 
symmetric  graph  where  each  data  point  Xi  corresponds  to  a  node.  Two  nodes  xt  and  x3  are  con- 


94 


nected  by  an  edge  with  weight  w(xi,  Xj)  =  w(xj,Xi )  reflecting  the  degree  of  similarity  (or  affin¬ 
ity)  between  these  two  points.  The  weight  function  w(-,  •)  describes  the  first-order  interaction 
between  the  data  points  and  its  choice  is  application-driven.  For  instance,  in  applications  where 
a  distance  already  exists  on  the  data,  it  is  custom  to  weight  the  edge  between  x%  and  x3  by 
w(xi,xj )  =  exp(— d(xi,  Xj)2 /e),  where  e  >  0  is  a  scale  parameter.  In  this  paper,  although  our 
method  would  apply  to  general  weights,  we  will  mainly  focus  on  this  type  of  Gaussian-weight 
graph. 

Following  a  classical  construction  in  spectral  graph  theory  [23]  and  in  manifold  learning  [24], 
namely  the  normalized  graph  Laplacian,  we  now  create  a  random  walk  on  the  data  set  f)  by  forming 
the  kernel 


Pl(Xi,Xj) 


w(Xj,Xj) 

d{xi ) 


where  d(xi)  =  J2xkenw(Xi->Xk)  ’s  the  degree  of  node  x*.  As  we  have  that  pi(xt,  x:))  >  0  and 
^2jeQPi(xi,xj)  =  1,  the  quantity  p\(xl, x3)  can  be  interpreted  as  the  probability  of  a  random 
walker  to  jump  from  Xi  to  x3  in  a  single  time  step  [23,  25].  If  P  is  the  n  x  n  matrix  of  transition  of 
this  Markov  chain,  then  taking  powers  of  this  matrix  amounts  to  running  the  chain  forward  in  time. 
Let  be  the  kernel  corresponding  to  the  tth  power  of  the  matrix  P.  Then,  pt(-,  •)  describes 

the  probabilities  of  transition  in  t  time  steps.  The  essential  point  of  the  diffusion  framework  is  the 
idea  that  running  the  chain  forward  will  reveal  intrinsic  geometric  structures  in  the  data  set,  and 
taking  powers  of  the  matrix  P  is  equivalent  to  integrating  the  local  geometry  of  the  data  at  different 
scales. 

An  equivalent  way  to  look  at  powers  of  P  is  to  make  use  of  its  eigenvectors  and  eigenvalues: 
it  can  be  showed  that  there  exists  a  sequence  1  =  Ao  >  N  >  I'M  >  ...  of  eigenvalues  and  a 
collection  {MM i>  ^2,  •••}  of  (right)  eigenvectors  for  P: 


P^i  =  A  M  • 


These  eigenvalues  and  eigenvectors  provide  embedding  coordinates  for  the  set  Q.  The  data  points 


95 


can  be  mapped  into  a  Euclidean  space  via  the  embedding 


%  :  x  1 — *  (A^iO),  •  •  • ,  A tm(t)i’m(t)(x))  ,  (2.1) 

where  t  >  0.  Discussions  regarding  the  number  m(t)  of  diffusion  coordinates  to  employ  and 
concerning  the  connection  with  the  so-called  diffusion  distance  are  provided  in  [22,  26,  ?]. 

Next,  we  address  the  issue  of  obtaining  a  density-invariant  embedding.  The  point  is  to  make  the 
embedding  reflect  only  the  geometry  of  the  data  and  be  insensitive  to  the  distribution  of  the  points. 
Classical  eigenmap  methods  [5,  6,  7,  27],  provide  an  embedding  that  combines  the  information  of 
both  the  density  and  geometry,  and  the  embedding  coordinates  heavily  depend  on  the  density  of  the 
data  points.  In  order  to  remove  the  influence  of  the  distribution  of  the  data  points,  we  renormalize 
the  Gaussian  edge  weights  we(-,  •)  with  an  estimate  of  the  density.  This  is  summarized  in  Algorithm 
1  which  was  first  introduced  and  analyzed  in  [22], 

Algorithm  1  Approximation  of  the  Laplace-Beltrami  diffusion 
1:  Start  with  a  rotation-invariant  kernel  w£(xl,  x j)  =  h 

2:  Let 


Qe(Xi)  =  ^2  We{Xi,Xj)  , 


7 


and  form  the  new  kernel 


W£(Xi,Xj )  = 


w£[xi:xj) 


qe{xi)qe{xj )  ' 

3:  Apply  the  normalized  graph  Laplacian  construction  to  this  kernel,  i.e.,  set 


(2.2) 


de(x)  =  y^we(x», 


x 


3  )  5 


and  define  the  anisotropic  transition  kernel 


Pe(Xi,Xj)  = 


VJejXj^Xj) 

de(Xi) 


Next  we  describe  the  data  fusion  scheme,  where,  for  the  sake  of  clarity,  we  direct  our  discussion 
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to  the  case  of  two  input  channels,  while  it  can  be  easily  extended  to  an  arbitrary  number  of  them. 
Suppose  one  has  two  sets  of  measurements  related  to  a  particular  phenomenon  Cl  =  {xi,  ...,xn}. 
Denote  these  sets  of  measurements  Qi  =  {y\ , . . . ,  y'n }  and  Cl2  =  {vh  ■■■■  Hu}'  respectively,  where 
y\  and  yj  are  high-dimensional  measurements.  We  aim  to  fuse  0|  and  Cl>  by  computing  a  unified 
low-dimensional  representation  0  =  {z\ . ...,  zn } .  Note  that  we  assume  that  Cl i and  Cl2  are  aligned, 
meaning  that  y)  and  y]  relate  to  the  same  instance  Xi.  When  this  assumption  is  invalid,  one  has  to 
apply  a  multi-sensor  alignment  scheme  [12]  prior  to  applying  the  fusion  procedure. 

We  start  by  computing  the  Laplace-Beltrami  embeddings  of  fi|  and  Cl2  denoted  =  {coj. ....  <)} 

and  =  {4>l, ...,  respectively,  where  rn,  is  the  dimensionality  of  each  embedding.  Each 
representation  reflects  the  geometry  of  the  data  as  viewed  by  each  sensor  individually.  In  order  to 
combine  these  analyzes  into  a  unified  representation  Cl,  we  form  Cl  =  {zi, ...,  zn }  where 

*  =  {$,$},  (2-3) 

<pl  and  <t>j  being  of  dimensions  mi  and  m2,  respectively.  In  general,  given  I\  input  sources  we  have 

Zi  =  {<#,-  ••,#}•  (2.4) 

This  boils  down  to  combining  the  embedding  coordinates  corresponding  to  each  sample  Xi  over 
the  different  input  channels  {f^}. 

In  essence,  our  scheme  is  the  embedding  analogue  of  boosting  [28],  where  instead  of  adaptively 
integrating  the  output  of  several  classifiers,  we  combine  different  embeddings.  In  particular,  one 
can  consider  an  equivalent  to  the  AdaBoost  scheme  [28]  for  semi-supervised  classification,  where 
Eq.  2.4  is  replaced  with 

%  =  {al(t)],...,aK(t)f},  (2.5) 

{a1, . . . ,  aK}  being  the  weights  per  embedding.  In  that  sense,  the  embeddings  z,  can  be  considered 
as  different  features,  and  one  can  apply  a  standard  implementation  of  AdaBoost  to  Eq.  2.4.  Yet, 
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in  this  work,  the  focal  point  is  to  derive  general-purpose  coordinates  regardless  of  a  particular 

application.  The  scheme  is  summarized  in  Algorithm  2. 

Algorithm  2  Multisensor  embedding 
1:  Starting  with  K  input  sources  =  {y\, ...,  y^},  k  =  1...K. 

2:  Compute  the  Laplace-Beltrami  embeddings  of  {Qfc},  denoted  4>™fc,  where  rnk  is  the  dimen¬ 
sionality  of  the  embedding  of  the  k’ th  channel. 

3:  Compute  the  unified  coordinates  set  Q  =  {zi,  ...,zn}  by  appending  the  embeddings  of  each 
input  sensor 

Zi  =  {4>] ,  •  •  •  >  },  *  =  1...W,  2  k  =  1...K. 


3  Experimental  results 

The  proposed  scheme  was  experimentally  verified  by  applying  it  to  two  tasks.  First,  we  extend 
our  former  results  in  visual-only  lip-reading  [4]  to  audio-visual  data.  The  audio-visual  inputs  are 
integrated  using  the  multisensor  fusion  scheme  given  in  Section  2  and  used  for  spoken-digit  recog¬ 
nition.  We  show  that  the  fused  multi-sensor  representation  provides  better  recognition.  Second, 
we  integrate  several  image  cues  (texture,  RGB  values,  contours  etc.)  and  show  that  using  them  in 
conjugation  improves  the  segmentation  results. 

3.1  Spoken-digit  recognition 

We  start  by  providing  a  short  description  of  the  experimental  setup.  We  follow  the  statistical  learn¬ 
ing  scheme  used  in  [4],  where  the  classifier  was  constructed  in  two  steps.  First  we  parametrized 
the  embedding  manifold  using  a  large  number  of  unlabeled  samples.  The  embedding  is  then  ex¬ 
tended,  using  the  Geometric  Harmonics  [4,  29],  to  a  small  set  of  labeled  examples  to  create  a  set  of 
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“signatures”  in  the  embedding  coordinates.  Then,  given  a  test  sample,  we  embed  it  by  extending 
the  manifold  embedding,  and  find  the  nearest  signature  in  the  embedding  space. 

To  this  end,  we  recorded  several  grayscale  movies  depicting  the  lips  of  a  subject  reading  a  text 
in  English  and  retained  both  the  video  sequence  and  the  audio  track.  Each  video  frame  was  cropped 
into  a  rectangular  of  size  140  x  110  around  the  lips  and  was  viewed  as  a  point  in  M110xli0.  As  far 
as  the  audio  data  was  concerned,  the  sound  signal  was  broken  up  into  overlapping  time-windows 
centered  at  the  beginning  of  each  video  frame.  The  sampling  rate  of  the  video  begin  25  frames  per 
seconds,  we  chose  to  form  windows  of  duration  equal  to  8  ms,  that  is,  equal  to  the  duration  of  two 
video  frames.  In  order  to  reduce  the  influence  of  this  splitting,  each  piece  of  signal  was  multiplied 
by  a  bell-shaped  function  and  we  then  computed  the  DCT  of  the  result.  Last,  we  considered  the 
logarithm  of  the  magnitude  of  this  function  as  being  the  audio  features.  The  audio  and  video  data 
sets  therefore  contained  the  same  number  of  points. 

The  first  data  set  consisted  of  6000  video  frames  (and  as  many  audio  windows),  corresponding 
to  the  speaker  reading  a  press  article.  We  will  refer  to  this  data  as  “text  data”.  Next,  we  asked  the 
subject  to  repeat  each  digit  “zero”,  “one”, ...  ,  “nine”  40  times.  This  was  used  to  construct  a  small 
vocabulary  of  words  later  employed  for  training  and  testing  a  simple  classifier.  To  each  spoken 
digit  corresponded  a  sequence  of  frames  in  the  video  data,  and  a  sequence  of  time- windows  for  the 
audio  data.  We  will  refer  to  this  data  as  “digit  data”. 

We  proceeded  as  follows  for  each  channel:  first,  the  data  points  corresponding  to  the  text  data 
were  used  to  learn  the  geometry  of  speech  data  as  we  formed  a  graph  with  Gaussian  weights 
exp(l|Xi~Xjl1  )  on  the  edges,  for  an  appropriately  chosen  scale  £  >  0.  We  then  renormalized  the 
Gaussian  weights  using  the  Laplace-Beltrami  normalization  described  in  Algorithm  1.  In  order 
to  obtain  a  low-dimensional  parametrization  we  computed  the  diffusion  coordinates  on  this  new 
graph.  Therefore  we  ended  up  with  two  embeddings,  corresponding  to  either  the  audio  or  visual 
data. 
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The  next  step  involved  the  digits  data.  We  computed  the  diffusion  coordinates  for  all  of  the 
samples  in  the  digits  data,  by  applying  the  Geometric  Harmonics  scheme  [4,  29]  and  extending  the 


diffusion  coordinates  computed  on  the  text  data. 

In  order  to  train  a  classifier  for  digit  identification,  we  randomly  selected  20  sequences  of  each 
digit,  the  remaining  sequences  being  used  as  a  test  set.  Each  digit  word  can  now  be  viewed  as 


0.9- 


Figure  1:  The  visual  data  in  the  first  3  diffusion  coordinates.  We  also  represented  a  trajectory 
corresponding  to  an  instance  of  the  word  “one”. 

a  trajectory  in  the  diffusion  space  and  the  word  recognition  problem  now  amounts  to  identifying 
trajectories  in  the  diffusion  space  (see  Fig.  1).  We  can  now  build  a  classifier  based  on  comparing 
a  new  trajectory  to  a  collection  of  labeled  trajectories  in  the  training  set.  In  order  to  compare 
trajectories  in  the  diffusion  space  we  used  the  symmetric  Hausdorff  distance  between  two  sets  F  i 
and  r2,  defined  as 


dniTi,  r2)  =  max  <  max  min  \\\xi  —  x2||},  max  min  {||xi  —  a;2||}  >  .  (3.1) 


Results  of  this  classifier  for  the  visual  data  only  were  already  presented  in  [4],  where  15  eigen¬ 
vectors  were  used  for  the  embedding.  For  the  sake  of  completeness,  we  re-ran  this  experiment  with 


100 


10  eigenvectors.  The  results  are  shown  in  Table  1.  Concerning  the  audio  data,  the  classification 
performance  table  for  a  classifier  using  10  eigenvectors  is  presented  in  Table  2. 
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Table  1:  Visual  only  based  classifier  performance,  averaged  over  50  random  trials  and  using  10 
diffusion  coordinates.  Each  row  corresponds  the  classification  distribution  of  a  given  digit  over 
then  10  classes.  The  data  set  was  embedded  in  15  dimensions. 

Finally,  in  order  to  illustrate  the  superiority  of  combining  both  data  channels  using  our  multi¬ 
sensor  integration  scheme,  we  present  the  results  obtained  when  using  Algorithm  2  (see  Table  3). 
More  precisely,  we  appended  the  first  5  eigenvectors  of  the  audio  data  with  the  top  5  eigenvectors 
of  the  video  data,  and  then  computed  a  new  graph  from  this  new  feature  representation  of  the  data. 
Finally  a  classifier  was  trained  and  tested  on  an  embedding  using  10  eigenvectors  of  the  diffusion 
defined  on  this  new  graph  (see  Fig.  ??). 

A  summary  of  all  performances  is  also  shown  in  Table  4.  Clearly,  our  scheme  combining 
both  channels  outperforms  the  classifiers  using  only  one  channel.  More  precisely,  it  seems  to 
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Table  2:  Audio  only  based  classifier  performance,  averaged  over  50  random  trials  and  using  10 
diffusion  coordinates.  Each  row  corresponds  the  classification  distribution  of  a  given  digit  over 
then  10  classes.  The  data  set  was  embedded  in  15  dimensions. 
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get  the  best  of  the  predictive  powers  of  the  audio  and  visual  classifiers.  In  fact,  this  is  a  straight 
consequence  of  the  concatenation  of  the  audio  and  visual  diffusion  features.  For  instance,  “one”  is 
very  successfully  classified  using  the  visual  channel.  As  suggested  in  [4],  typical  frame  sequences 
corresponding  to  the  word  “one”  contain  pictures  with  an  open  mouth  and  no  teeth  appearing 
STEPHANE:  ADD  A  FEW  PICTURES  ILLUSTRATING  THIS  POINT.  This  type  of  frame 
almost  never  appear  in  other  digit  sequences.  As  a  consequence,  trajectories  for  the  word  “one” 
will  be  well  separated  from  other  digit  trajectories  in  the  visual  diffusion  space  STEPHANE: 
SHOW  TRAJECTORY  PIC.  As  far  as  audio  is  concerned,  the  separation  is  not  so  important  and 
there  is  some  amount  of  confusion  with  “five”  and  “nine”.  When  appending  both  the  audio  and 
visual  representations,  the  separation  remains  high. 

Notice  also  that  these  good  results  were  obtained  despite  the  fact  that  we  used  only  5  eigenvec¬ 
tors  from  each  channel  in  the  combined  scheme,  when  10  eigenvectors  were  used  for  either  of  the 
single-channel  schemes. 


3.2  Image  segmentation 

The  sensor  fusion  scheme  was  also  applied  to  multi-cue  image  segmentation.  As  features  we  used 
combinations  of  Interleaving  Contours  (IC)  [30],  the  P_>  metric  between  RGB  values  and  Gabor 
filters  based  texture  descriptors  [31].  The  Gabor  filters  used  3  scales  and  8  orientations.  For  each 
pixel,  the  metrics  were  computed  in  an  area  of  5  x  5  pixels  around  P.  Such  that  given  a  feature 
f(i,j )  computed  over  the  image  /,  the  distance  between  the  pixels  P  (i,j)  and  Pi  (A,  jfi)  with 
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“0” 

44J59 

“2” 

“3” 

“4” 

“5” 

“6” 

“8” 

“9” 

zero 

0.90 

0.00 

0.00 

0.00 

0.00 

0.00 

0.06 

0.04 

0.00 

0.00 

one 

0.00 

0.99 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.01 

two 

0.00 

0.00 

0.96 

0.01 

0.02 

0.00 

0.00 

0.00 

0.00 

0.00 

three 

0.00 

0.00 

0.00 

0.99 

0.00 

0.00 

0.00 

0.00 

0.00 

0.01 

four 

0.00 

0.00 

0.00 

0.04 

0.96 

0.00 

0.00 

0.00 

0.00 

0.00 

five 

0.00 

0.00 

0.00 

0.00 

0.00 

0.97 

0.00 

0.00 

0.02 

0.01 

six 

0.06 

0.00 

0.00 

0.00 

0.00 

0.00 

0.90 

0.04 

0.00 

0.00 

seven 

0.03 

0.00 

0.00 

0.00 

0.00 

0.00 

0.03 

0.93 

0.00 

0.00 

eight 

0.00 

0.00 

0.00 

0.00 

0.00 

0.01 

0.00 

0.01 

0.95 

0.03 

nine 

0.00 

0.01 

0.00 

0.00 

0.00 

0.01 

0.00 

0.00 

0.02 

0.96 

Table  3:  Combined:  Classification  results  for  the  scheme  combining  both  channels,  over  50  ran¬ 
dom  trials.  The  combined  graph  was  built  from  a  feature  representation  of  the  data  based  on 
appending  the  first  5  eigenvectors  of  the  audio  channel  with  the  first  5  eigenvectors  of  the  video 
stream.  From  this  graph,  we  computed  10  eigenvectors,  and  we  used  them  for  representing  the 
data. 


Channel  type 

“0” 

44 

44 2 ” 

“3” 

“4” 

“5” 

“6” 

44y?9 

“8” 

449?? 

Audio 

0.75 

0.94 

0.87 

0.90 

0.96 

0.86 

0.93 

0.81 

0.80 

0.92 

Visual 

0.90 

0.99 

0.90 

0.94 

0.93 

0.81 

0.87 

0.74 

0.75 

0.82 

Combined 

0.90 

0.99 

0.96 

0.99 

0.96 

0.97 

0.90 

0.93 

0.95 

0.96 

Table  4:  Summary 
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respect  to  the  feature  is  given  by 


D(P,Pi) 


Wf(iJ)  -  f(h,ji)\\L2  y/(i  -  h)2  +  (j  -  ji)2  <  2 

oo  else 


(3.2) 


Equation  3.2  is  used  to  sparsify  the  affinity  matrix  of  /,  otherwise,  its  eigenvectors  computation 
become  computationally  exhaustive  for  common  image  sizes.  Applying  Eq.  3.2  might  create 
spurious  additional  parameterizations  related  to  the  spatial  coordinates.  For  instance,  consider 
the  vertical  and  horizontal  lines  in  all  of  the  segmentations  in  Fig.  2.  We  refrained  from  using 
the  Nystrom  Method  [32],  that  would  have  resolved  this  issue,  in  order  to  simplify  the  testing 
procedure  and  as  this  phenomenon  is  well  understood. 

For  every  input  image,  we  computed  several  embeddings  and  the  integrated  representation  was 
computed  by  the  procedure  described  in  Section  2.  We  emphasize,  that  for  each  image,  the  same 
embedding  vectors  were  used  both  for  the  single  and  multi-cue  segmentations.  In  all  of  the  simula¬ 
tions  we  used  5  eigenvectors  from  each  feature.  For  all  images  we  present  the  segmentation  results 
of  applying  k-means  clustering  to  each  of  the  original  embeddings  and  the  fused  coordinates.  This 
follows  the  Modified-NCut  (MNCut)  image  segmentation  scheme  [33].  The  scheme  was  imple¬ 
mented  in  Matlab  and  used  the  built-in  kmeans  and  SVD  implementations.  Note  that  the  regular 
Graph-Laplacian  was  used  for  the  segmentation  and  not  the  density-invariant  Laplace-Beltrami. 

Figure  2  depicts  the  segmentation  results  of  the  Tiger  image  taken  from  the  Berkeley  segmen¬ 
tation  database.  The  images  were  segmented  using  the  IC  and  RGB  features  and  the  result  are 
shown  Figs.  2a  and  2b,  respectively.  The  segmentation  results  in  Fig.  2c  show  that  using  the  using 
fused  coordinates  provided  better  results  than  either  the  IC  or  the  RGB  segmentation  results. 

Different  features  were  used  in  Fig.  3.  The  IC  feature  is  inefficient  in  analyzing  highly-textured 
images,  as  it  creates  over-segmentation.  Thus,  we  used  the  RGB  and  texture  features.  The  texture 
based  segmentation  (Fig.  3a)  results  in  over- segmentation  in  the  lizard’s  body,  while  missing  the 
cut  between  the  front  and  background  rocks  on  the  left  side  of  the  image.  Similarly,  using  the  RGB 
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(a)  "interleaving  contours 


Figure  2:  Applying  the  proposed  scheme  to  the  Tiger  image,  (a)  Segmentation  achieved  using 
the  Interleaving  contours  edge  based  features,  (b)  Segmentation  results  based  on  L_>  differences  in 
RGB  values,  (c)  Using  the  fused  coordinates  we  achieve  a  visually  better  pleasing  result. 


(b)  RGB” 


Figure  3:  Applying  the  proposed  scheme  to  the  Lizard  image,  (a)  Segmentation  achieved  using  the 
texture  features.  Note  the  over  segmentation  in  the  are  behind  the  Lizard’s  head,  (b)  Segmentation 
results  based  on  L2  differences  in  RGB  values.  Note  the  over-segmentation  above  the  Lizard’s  leg. 
(c)  Using  the  fused  coordinates  we  achieve  a  visually  better  pleasing  result. 
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descriptor  also  results  in  over-segmentation.  In  contrast,  the  combined  segmentation  is  better  eye 
pleasing  and  is  able  to  detect  salient  multi-cue  edges  in  the  image. 


(d)  Segmentation  results 
at  Scale  #1 


(e)  Segmentation  results 
at  Scale  #2 


(f)  Segmentation  results 
at  Scale  #3 


(g)  Fused  coordinates  re¬ 
sults 


Figure  4:  Multisensor  embedding  applied  to  multiscale  image  segmentation.  The  Interleaving 
contours  edge  based  feature  was  applied  to  each  of  the  image  in  the  first  row  ((a),(b),(c)).  The 
second  row  depicts  the  corresponding  segmentation  results,  (g)  show  the  improved  segmentation 
achieved  using  the  fused  coordinates. 

Finally,  we  applied  the  fusion  scheme  to  multi-scale  image  segmentation.  The  image  was 
smoothed  by  a  Gaussian  kernel  and  three  resolution  scales  (shown  in  Figs.  4a,  4b  and  4c)  were 
created.  The  IC  feature  was  computed  based  on  each  image  and  the  embeddings  were  fused.  We 
see  that  using  the  proposed  scheme  resulted  in  a  segmentation  that  combined  the  mutual  cluster 
boundaries  in  the  image,  allowing  to  overlook  some  of  the  spurious  segmentations,  such  as  the 
left  eye  in  Fig.  4a  and  the  throat  area  in  4b.  In  [17]  the  multiscale  segmentation  was  computed 
via  a  computation  of  an  “average  cut”.  There,  the  Markov  matrices  that  were  computed  at  each 
scale  were  used,  rather  than  the  embedding  vectors.  In  practice,  there  is  no  difference  between  the 
multiscale  fusion  and  the  fusion  of  the  other  descriptors  depicted  in  Figs.  2  and  3.  In  particular 
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one  can  combine  different  features  and  scales  directly. 

To  conclude,  by  fusing  the  different  features,  we  were  able  to  achieve  better  segmentation  re¬ 
sults.  In  essence  this  approach  resembles  the  biological  vision  systems  by  combining  different  cues 
and  emphasizing  salient  multi-features  edges.  The  scheme  is  flexible  and  once  the  embeddings  of 
each  feature  are  computed,  one  can  combine  the  embeddings  in  any  possible  way  without  having 
to  recompute  them. 

4  Conclusions  and  future  work 

In  this  work  we  presented  a  unified  multisensor  data  embedding  scheme,  based  on  the  diffusion 
framework.  The  fusion  was  achieved  by  combining  the  embeddings  of  different  input  channels.  We 
applied  the  scheme  to  audio-visual  lip  reading  and  image  segmentation  that  are  typical  examples 
of  multisensor  pattern  recognition  and  classification.  In  both  cases,  the  results  achieved  by  using 
fused  coordinates  were  superior  to  those  of  the  single  sensor. 

We  embedded  each  data  source  separately  and  then  appended  the  embeddings  to  produce  the 
fused  representation.  Although  this  approach  is  straightforward  and  allows  to  combine  different 
channels  easily,  it  is  possible  that  different  channels  are  correlated.  Then,  one  can  find  a  lower 
dimensional  representation  by  considering  the  unified  coordinates  as  a  the  features  of  a  signal  and 
re-embedding  them  to  further  reduce  the  dimensionality. 

The  image  segmentation  results,  suggest  that  in  certain  applications,  one  can  utilize  a  variety  of 
features  in  different  resolution  scales.  Thus,  due  to  the  large  number  of  possible  input  channels,  it 
might  be  beneficial  to  compute  adaptive  weights  that  maximize  a  certain  criterion.  For  instance,  in 
semi- supervised  classification  problems,  one  can  train  the  weights  of  the  combined  representation 
for  optimal  classification  over  a  training  set  by  using  the  AdaBoost  algorithm. 
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GEOMETRIC  DIFFUSIONS  AS  A  TOOL  FOR 
HARMONIC  ANALYSIS  AND  STRUCTURE 
DEFINITION  OF  DATA 

PART  I:  DIFFUSION  MAPS 

R.  R.  COIFMAN,  S.  LAFON,  A.  B.  LEE,  M.  MAGGIONI,  B. 
NADLER,  F.  WARNER,  AND  S.W.  ZUCKER 

Abstract.  We  provide  a  framework  for  structural 
multiscale  geometric  organization  of  graphs  and  sub¬ 
sets  of  Mn.  We  use  diffusion  semigroups  to  gener¬ 
ate  multiscale  geometries  in  order  to  organize  and 
represent  complex  structures.  We  show  that  appro¬ 
priately  selected  eigenfunctions  or  scaling  functions 
of  Markov  matrices,  which  describe  local  transitions, 
lead  to  macroscopic  descriptions  at  different  scales. 

The  process  of  iterating  or  diffusing  the  Markov  ma¬ 
trix  is  seen  as  a  generalization  of  some  aspects  of 
the  Newtonian  paradigm,  in  which  local  infinitesimal 
transitions  of  a  system  lead  to  global  macroscopic  de¬ 
scriptions  by  integration.  In  Part  I  below,  we  provide 
a  unified  view  of  ideas  from  data  analysis,  machine 
learning  and  numerical  analysis.  In  the  second  part 
of  this  paper,  we  augment  this  approach  by  intro¬ 
ducing  fast  order-iV  algorithms  for  homogenization 
of  heterogeneous  structures  as  well  as  for  data  repre¬ 
sentation. 


1.  Introduction 

The  geometric  organization  of  graphs  and  data  sets  in  Mn 
is  a  central  problem  in  statistical  data  analysis.  In  the  con¬ 
tinuous  Euclidean  setting,  tools  from  harmonic  analysis,  such 
as  Fourier  decompositions,  wavelets  and  spectral  analysis  of 
pseudo-differential  operators  have  proven  highly  successful  in 
many  areas  such  as  compression,  denoising  and  density  esti¬ 
mation  [1,  2].  In  this  paper,  we  extend  multiscale  harmonic 
analysis  to  discrete  graphs  and  subsets  of  Mn.  We  use  diffu¬ 
sion  semigroups  to  define  and  generate  multiscale  geometries 
of  complex  structures.  This  framework  generalizes  some  as¬ 
pects  of  the  Newtonian  paradigm,  in  which  local  infinitesimal 
transitions  of  a  system  lead  to  global  macroscopic  descriptions 
by  integration  —  the  global  functions  being  characterized  by 
differential  equations.  We  show  that  appropriately  selected 
eigenfunctions  of  Markov  matrices  (describing  local  transi¬ 
tions,  or  affinities  in  the  system)  lead  to  macroscopic  repre¬ 
sentations  at  different  scales.  In  particular,  the  top  eigen¬ 
functions  permit  a  low-dimensional  geometric  embedding  of 
the  set  into  Mfe,  with  fc«n,  so  that  the  ordinary  Euclidean 
distance  in  the  embedding  space  measures  intrinsic  diffusion 
metrics  on  the  data.  Many  of  these  ideas  appear  in  a  variety 


of  contexts  of  data  analysis,  such  as  spectral  graph  theory, 
manifold  learning,  nonlinear  principal  components  and  kernel 
methods.  We  augment  these  approaches  by  showing  that  the 
diffusion  distance  is  a  key  intrinsic  geometric  quantity  link¬ 
ing  spectral  theory  of  the  Markov  process,  Laplace  operators, 
or  kernels,  to  the  corresponding  geometry  and  density  of  the 
data.  This  opens  the  door  to  the  application  of  methods  from 
numerical  analysis  and  signal  processing  to  the  analysis  of 
functions  and  transformations  of  the  data. 


2.  DIFFUSIONS  MAPS 

The  problem  of  finding  meaningful  structures  and  geomet¬ 
ric  descriptions  of  a  data  set  X  is  often  tied  to  that  of  di¬ 
mensionality  reduction.  Among  the  different  techniques  de¬ 
veloped,  particular  attention  has  been  paid  to  kernel  meth¬ 
ods  [3].  Their  nonlinearity  as  well  as  their  locality-preserving 
property  are  generally  viewed  as  a  major  advantage  over  clas¬ 
sical  methods  like  Principal  Component  Analysis  and  classical 
Multidimensional  Scaling.  Several  other  methods  to  achieve 
dimensional  reduction  have  also  emerged  from  the  field  of 
manifold  learning,  e.g.  Local  Linear  Embedding  [4],  Lapla- 
cian  eigenmaps  [5],  Hessian  eigenmaps  [6],  Local  Tangent 
Space  Alignment  [7].  All  these  techniques  minimize  a  qua¬ 
dratic  distortion  measure  of  the  desired  coordinates  on  the 
data,  naturally  leading  to  the  eigenfunctions  of  Laplace  type 
operators  as  minimizers.  We  extend  the  scope  of  application 
of  these  ideas  to  various  tasks,  such  as  regression  of  empirical 
functions,  by  adjusting  the  infinitesimal  descriptions,  and  the 
description  of  the  long-time  asymptotics  of  stochastic  dynam¬ 
ical  systems. 

The  simplest  way  to  introduce  our  approach  is  to  consider 
a  set  X  of  normalized  data  points.  Define  the  “quantized” 
correlation  matrix  C  =  {c^-},  where  Cij  =  1  if  {pci  -Xj)  >  0.95, 
and  =  0  otherwise.  We  view  this  matrix  as  the  adja¬ 
cency  matrix  of  a  graph  on  which  we  define  an  appropriate 
Markov  process  to  start  our  analysis.  A  more  continuous 

1-iXi-Xj)  \\Xi-XjW2 

kernel  version  can  be  defined  as  cij  =  e  *  =  e  ^ 

The  remarkable  fact  is  that  the  eigenvectors  of  this  “corrected 
correlation”  can  be  much  more  meaningful  in  the  analysis  of 
data  than  the  usual  principal  components  as  they  relate  to 
diffusion  and  inference  on  the  data. 

As  an  illustration  of  the  geometric  approach,  suppose  that 
the  data  points  are  uniformly  distributed  on  a  manifold  X. 
Then  it  is  known  from  spectral  graph  theory  [8]  that  if  W  = 
{wij}  is  any  symmetric  positive  semi-definite  matrix,  with 
non-negative  entries,  then  the  minimization  of 

Q(f)  =  XAu/i  -/i)2> 
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where  /  is  a  function  on  the  data  set  X  with  the  additional 
constraint  of  unit  norm,  is  equivalent  to  finding  the  eigen¬ 
vectors  of  D-^WD*,  where  D  =  {dij}  is  a  diagonal  ma¬ 
trix  with  diagonal  entry  da  equal  to  the  sum  of  the  elements 
of  W  along  the  ith  row.  Belkin  et  al  [5]  suggest  the  choice 

\\Xi-XjW2 

Wij  —  e  ^  ,  in  which  case  the  distortion  Q  clearly  pe¬ 

nalizes  pairs  of  points  that  are  very  close,  forcing  them  to  be 
mapped  to  very  close  values  by  /.  Likewise,  pairs  of  points 
that  are  far  away  from  each  other  play  no  role  in  this  min¬ 
imization.  The  first  few  eigenfunctions  {(f>k}  are  then  used 
to  map  the  data  in  a  nonlinear  way  so  that  the  closeness  of 
points  is  preserved.  We  will  provide  a  principled  geometric 
approach  for  the  selection  of  eigenfunction  coordinates. 

This  general  framework  based  upon  diffusion  processes  leads 
to  efficient  multiscale  analysis  of  data  sets  for  which  we  have  a 
Heisenberg  localization  principle  relating  localization  in  data 
to  localization  in  spectrum.  We  also  show  that  spectral  prop¬ 
erties  can  be  employed  to  embed  the  data  into  a  Euclidean 
space  via  a  diffusion  map.  In  this  space,  the  data  points  are 
reorganized  in  such  a  way  that  the  Euclidean  distance  corre¬ 
sponds  to  a  diffusion  metric.  The  case  of  submanifolds  of  Mn 
is  studied  in  greater  detail  and  we  show  how  to  define  different 
kinds  of  diffusions  in  order  to  recover  the  intrinsic  geometric 
structure,  separating  geometry  from  statistics.  More  details 
on  the  topics  covered  in  this  section  can  be  found  in  [9].  We 
also  propose  an  additional  diffusion  map  based  on  a  specific 
anisotropic  kernel  whose  eigenfunctions  capture  the  long-time 
asymptotics  of  data  sampled  from  a  stochastic  dynamical  sys¬ 
tem  [10]. 


2.1.  Construction  of  the  diffusion  map.  From  the  above 
discussion,  the  data  points  can  be  thought  of  as  being  the 
nodes  of  a  graph  whose  weight  function  k(x,y)  (also  referred 
to  as  “kernel”  or  “affinity  function”)  satisfies  the  following 
properties: 

•  k  is  symmetric:  k(x,y)  =  k(y,x ), 

•  k  is  positivity  preserving:  for  all  x  and  ymX,k(x,y)  > 

0, 

•  k  is  positive  semi-definite:  for  all  real- valued  bounded 
functions  /  defined  on  X, 

f  [  k(x,y)f(x)f(y)dfj,(x)dfi(y)  >  0, 

Jx  Jx 

where  y  is  a  probability  measure  on  X. 

The  construction  of  a  diffusion  process  on  the  graph  is  a  clas¬ 
sical  topic  in  spectral  graph  theory  (weighted  graph  Laplacian 
normalization,  see  [8]),  and  the  procedure  consists  in  renor¬ 
malizing  the  kernel  k(x,y)  as  follows:  for  all  xEl, 


let 


k(x,y)dn(y) 


and  set 


a(x,y) 


v(x) 


Notice  that  we  have  the  following  conservation  property: 

(2.1)  [  a(x,y)dfi(y)  =  1, 

Jx 

therefore,  the  quantity  a(x,y)  can  be  viewed  as  the  probabil¬ 
ity  for  a  random  walker  on  X  to  make  a  step  from  x  to  y. 
Now  we  naturally  define  the  diffusion  operator 

Af(x)  =  f  a(x,  y)f(y)dn(y) . 

Jx 

As  is  well  known  in  spectral  graph  theory  [8],  there  is  a  spec¬ 
tral  theory  for  this  Markov  chain,  and  if  A  is  the  integral 
operator  defined  on  L2(X)  with  the  kernel 


(2.2)  a(x,y)  =  a(x,y) 

then  it  can  be  verified  that  A  is  a  symmetric  operator.  Con¬ 
sequently,  we  have  the  following  spectral  decomposition 

(2.3)  d(x,  y)  =  Yi  x2iMx)My) , 

z>0 

where  Ao  =  1  >  Ai  >  A2  >  ....  Let  d^m\x^y)  be  the  kernel  of 
A171.  Then  we  have 


(2.4)  a^m){x,y)  =  ^  \?m<pi(x)<t>i(y) . 

i>  0 


Last  we  introduce  the  family  of  diffusion  maps  by 

/  A™ <fio{x) 


4*//, ( x)  — 


A  TMX) 


V 


and  the  family  of  diffusion  distances  {Dm}  defined  by 
Drn (x,y)=  a(m)  {x,x)+  d(m)  (y,y)~  2a(m)  (x,  y) . 


The  quantity  a(x,y),  which  is  related  to  a(x,y)  according 
to  equation  (2.2),  can  be  interpreted  as  the  transition  prob¬ 
ability  of  a  diffusion  process,  while  a^rn\x^y)  represents  the 
probability  of  transition  from  x  to  y  in  m  steps.  To  this  diffu¬ 
sion  process  corresponds  the  distance  Dm(x,y)  which  defines 
a  metric  on  the  data  that  measures  the  rate  of  connectivity 
of  the  points  x  and  y  by  paths  of  length  m  in  the  data,  and 
in  particular,  it  is  small  if  there  are  a  large  number  of  paths 
connecting  x  and  y.  Note  that,  unlike  the  geodesic  distance, 
this  metric  is  robust  to  perturbations  on  the  data. 

The  dual  point  of  view  is  that  of  the  analysis  of  functions 
defined  on  the  data.  The  kernel  a(m\x,  •)  can  be  viewed  as 
a  bump  function  centered  at  x,  that  becomes  wider  as  m 
increases.  The  distance  D2m(x,y)  is  also  a  distance  between 
the  two  bumps  a^m^(x,  •)  and  a^m^(7/,  •): 

D2 m(x,y)=  [  \d(m\x,z)  -dim)(y,z)\2dz. 

Jx 

The  eigenfunctions  have  the  classical  interpretation  of  an  or¬ 
thonormal  basis,  and  their  frequency  content  can  be  related  to 
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the  spectrum  of  operator  A  in  what  constitutes  a  generalized 
Heisenberg  principle.  The  key  observation  is  that,  for  many 
practical  examples,  the  numerical  rank  of  the  operator  A 
decays  rapidly  as  seen  from  equation  (2.4)  or  from  Figure  1. 
More  precisely,  since  0  <  A^  <  Ao  =  1,  the  kernel 
and  therefore  the  distance  Dm(x ,  y),  can  be  computed  to  high 
accuracy  with  only  a  few  terms  in  the  sum  of  (2.4),  that  is 
to  say,  by  only  retaining  the  eigenfunctions  <pi  for  which  A?m 
exceeds  a  certain  precision  threshold.  Therefore,  the  rows 
(the  so-called  bumps)  of  Am  span  a  space  of  lower  numeri¬ 
cal  dimension,  and  the  set  of  columns  can  be  downsampled. 
Furthermore,  to  generate  this  space,  one  just  needs  the  top 
eigenfunctions,  as  prescribed  in  equation  (2.4).  Consequently, 
by  a  change  of  basis,  eigenfunctions  corresponding  to  eigen¬ 
values  at  the  beginning  of  the  spectrum  have  low  frequencies, 
and  the  number  of  oscillations  increase  as  one  moves  further 
down  in  the  spectrum. 

The  link  between  diffusion  maps  and  distances  can  be  sum¬ 
marized  by  the  spectral  identity 

Il$m0)  -  ®m{y)  II2  =^2\2jm((f>j(x)  -  (t>j{y))2  =  D2m(x,y), 
j>  o 

which  means  that  the  diffusion  map  <Fm  embeds  the  data  into 
a  Euclidean  space  in  which  the  Euclidean  distance  is  equal  to 
the  diffusion  distance  Dm.  Moreover,  the  diffusion  distance 
can  be  accurately  approximated  by  retaining  only  the  terms 
for  which  A^m  remains  numerically  significant:  the  embedding 

X\ — >X  =  (Aq* <t>a{x),  A™ (f>\  (x), Xjt>4>j0 (x)) 

satisfies 

jo— i 

Dm(x,v)  =  £  Af^Or)  -  4>Av))2  (!  +  0(e-am)) 

3= 0 

=  \\x-yf(l  +  0(e-am)). 

Therefore  there  exists  an  m o  such  that  for  all  m  >  mo,  the 
diffusion  map  with  the  first  jo  eigenfunctions  embeds  the  data 
into  W°  in  an  approximately  isometric  fashion,  with  respect 
to  the  diffusion  distance  Dm. 

2.2.  The  heat  diffusion  map  on  Riemannian  subman¬ 
ifolds.  Suppose  that  the  data  set  X  is  approximately  lying 
along  a  submanifold  M  C  Mn,  with  a  density  p(x)  (not  nec¬ 
essarily  uniform  on  A4 ).  This  kind  of  situation  arises  in 
many  applications  ranging  from  hyperspectral  imagery  to  im¬ 
age  processing  to  vision.  For  instance,  in  the  latter  field,  a 
model  for  edges  can  be  generated  by  considering  pixel  neigh¬ 
borhoods  whose  variability  is  governed  by  a  few  parameters 
[11,12], 

We  consider  isotropic  kernels,  i.e.,  kernels  of  the  form 


(a)  (b) 


Figure  2.  A  dumbbell  (a)  is  embedded  us¬ 
ing  the  first  3  eigenfunctions  (b).  Because 
of  the  bottleneck,  the  two  lobes  are  pushed 
away  from  each  other.  Observe  also  that  in 
the  embedding  space,  point  A  is  closer  to  the 
handle  (point  B)  than  any  point  on  the  edge 
(like  point  C),  as  there  are  many  more  short 
paths  joining  A  and  B  than  A  and  C. 

|  |  Qr*  _  y  |  |  2 

In  [5],  Belkin  et  al  suggest  to  take  ke{x,y)  =  e  e  and 
to  apply  the  weighted  graph  Laplacian  normalization  proce¬ 
dure  described  in  the  previous  section.  They  show  that  if  the 
density  of  points  is  uniform,  then  as  e  — >  0,  one  is  able  to 
approximate  the  Laplace-Beltrami  operator  A  on  AT 

However  when  the  density  p  is  not  uniform,  as  is  often  the 
case,  the  limit  operator  is  conjugate  to  an  elliptic  Schrodinger- 
type  operator  having  the  more  general  form  A  +  Q,  where 
Q(x)  =  is  a  potential  term  capturing  the  influence  of  the 

non-uniform  density.  By  writing  the  non-uniform  density  in 
a  Boltzmann  form,  p(x)  m  e~u(x\  the  infinitesimal  operator 
can  be  expressed  as 

(2.5)  A</>+(||VC/||2-A£/)0. 

This  generator  corresponds  to  the  forward  diffusion  operator 
and  is  the  adjoint  of  the  infinitesimal  generator  of  the  back¬ 
ward  operator,  given  by 

(2.6)  A0-2V0-  XU. 

As  is  well  known  from  quantum  physics,  for  a  double  well  po¬ 
tential  E7,  corresponding  to  two  separated  clusters,  the  first 
non-trivial  eigenfunction  of  this  operator  discriminates  be¬ 
tween  the  two  wells.  This  result  reinforces  the  use  of  the 
standard  graph  Laplacian  for  computing  an  approximation 
to  the  normalized  cut  problem,  as  described  in  [13],  and  more 
generally  for  the  use  of  the  first  few  eigenvectors  for  spectral 
clustering,  as  suggested  by  Weiss  [14] . 

In  order  to  capture  the  geometry  of  a  given  manifold,  re¬ 
gardless  of  the  density,  we  propose  a  different  normaliza¬ 
tion  that  asymptotically  recovers  the  eigenfunctions  of  the 
Laplace-Beltrami  (heat)  operator  on  the  manifold.  For  any 
rotation-invariant  kernel  k£(x,y)  =  h(\\x— y\\2 /s),  we  consider 
the  normalization  described  in  the  box  below.  The  operator 


115 


Figure  1.  Left:  spectra  of  some  powers  of  A.  Middle  and  right:  consider  a  mixture  of  two  materials  with 
different  heat  conductivity.  The  original  geometry  (middle)  is  mapped  as  a  “butterfly”  set,  in  which  the 
red  (higher  conductivity)  and  blue  phases  are  organized  according  to  the  diffusion  they  generate:  the  cord 
length  between  two  points  in  the  diffusion  space  measures  the  quantity  of  heat  that  can  travel  between  these 
points. 


A£  can  be  used  to  define  a  discrete  approximate  Laplace  op¬ 
erator  as  follows: 


and  it  can  be  verified  that  Ae  =  Aq  +  £^R£,  where  Aq  is  a 
multiple  of  the  Laplace-Beltrami  operator  A  on  M,  and  R£ 
is  bounded  on  a  fixed  space  of  bandlimited  functions.  From 
this,  we  can  deduce  the  following  result: 

Theorem  2.1.  Let  t  >  0  be  a  fixed  number,  then  as  e  — »  0, 

Ai  =  (7  -  eA£) I  =  (/  -  £tA0)  ‘  +  0(e* )  =  e~tA°  +  0(e* ) , 

t 

and  the  kernel  of  Ai  is  given  as 

ai‘\x,  y)  =  A J (ftf (x)^ (y) 

j>  o 

=  T  e”^j  (x)^'  (y)  +  °A ) 

j>  o 

=  ht(x,y)  +  <D(e*), 

where  {v2-}  and  are  the  eigenvalues  and  eigenfunctions 
of  the  limiting  Laplace  operator,  ht(x,y)  is  the  heat  diffusion 
kernel  at  time  t  and  all  estimates  are  relative  to  any  fixed 
space  of  bandlimited  functions. 


Approximation  of  the  Laplace-Beltrami  diffusion 
kernel 

1)  Let  p£{x)  =  fx  k£(x,  y)p(y)dy , 

and  form  the  new  kernel  ke(x,  y )  =  p^)p%)  • 

2)  Apply  the  weighted  graph  Laplacian 
normalization  to  this  kernel  by  defining 
vs(x)  =  fx  ke(x,  y)p(y)dy 

and  by  setting  a£(x,  y)  =  ■ 

Then  the  operator  Aef{pc)  =  fx  a£(x,  y)  f  {y)p{y)dy  is  an 
approximation  of  the  Laplace-Beltrami  diffusion  kernel 
at  time  e. 


For  simplicity,  we  assume  that  on  the  compact  manifold  M, 
the  data  points  are  relatively  densely  sampled  (each  ball  of 
radius  y/ e  contains  enough  sample  points  so  that  integrals  can 
approximated  by  discrete  sums).  Moreover,  if  the  data  only 
covers  a  subdomain  of  M  with  nonempty  boundary,  then  A0 
needs  to  be  interpreted  as  acting  with  Neumann  boundary 
conditions.  As  in  the  previous  section,  one  can  compute  heat 
diffusion  distances  and  the  corresponding  embedding.  More¬ 
over,  any  closed  rectifiable  curve  can  be  embedded  as  a  circle 
on  which  the  density  of  points  is  preserved:  we  have  thus  sep¬ 
arated  the  geometry  of  the  set  from  the  distribution  of  the 
points  (see  Figure  3  for  an  example). 

2.3.  Anisotropic  diffusion  and  stochastic  differential 
equations.  So  far  we  have  considered  the  analysis  of  general 
datasets  by  diffusion  maps,  without  considering  the  source  of 
the  data.  One  important  case  of  interest  is  when  the  data 
x  is  sampled  from  a  stochastic  dynamical  system.  Consider 
therefore  data  sampled  from  a  system  x(t)  G  Mn  whose  time 
evolution  is  described  by  the  following  Langevin  equation 

(2.7)  x  =  — VU{x)  A  V%w 

where  V  is  the  free  energy  and  w(t)  is  the  standard  n-dimensional 
Brownian  motion.  Let  p(y,t\x,  s )  denote  the  transition  prob¬ 
ability  of  finding  the  system  at  location  y  at  time  £,  given  an 
initial  location  x  at  time  s.  Then,  in  terms  of  the  variables 
{?/,£},  p  satisfies  the  forward  Fokker-Planck  equation  (FPE), 
for  t  >  s, 

(2.8)  ^  =  V.(Vp  +  pVU(y)) 

while  in  terms  of  the  variables  {x,  s},  the  transition  probabil¬ 
ity  satisfies  the  backward  equation 

(2.9)  =  Ap-  Vf>-  VU(x) 
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(a)  (b) 


(c)  (d) 


Figure  3.  Original  spiral  curve  (a)  and  the  density  of  points  on  it  (b),  embedding  obtained  from  the 
normalized  graph  Laplacian  (c)  and  embedding  from  the  Laplace-Beltrami  approximation  (d). 


As  time  t  —>  oo,  the  solution  of  the  forward  FPE  converges 
to  the  steady  state  Boltzmann  density 

-U{x) 

(2.10)  p(x)  =  - 

where  the  partition  function  Z  is  the  appropriate  normaliza¬ 
tion  constant. 

The  general  solution  to  the  FPE  can  be  written  in  terms 
of  an  eigenfunction  expansion 

oo 

(2.11)  p(x,t)  =  ^  (x) 

3=0 

where  A j  are  the  eigenvalues  of  the  Fokker-Planck  operator, 
with  Ao  =  1  >  Ai  >  A2  >  . . and  with  4>j(x)  the  correspond¬ 
ing  eigenfunctions.  The  coefficients  dj  depend  on  the  initial 
conditions.  A  similar  expansion  exists  for  the  backward  equa¬ 
tion,  with  the  eigenfunctions  of  the  backward  operator  given 
by  ^j{x)  =  eu^cf)j(x). 

As  can  be  seen  from  equation  (2.11),  the  long  time  asymp¬ 
totics  of  the  solution  is  governed  only  by  the  first  few  eigen¬ 
functions  of  the  Fokker-Planck  operator.  While  in  low  dimen¬ 
sions,  e.g.  n  <  3,  approximations  to  these  eigenfunctions  can 
be  computed  via  numerical  solutions  of  the  partial  differential 
equation,  in  general,  this  is  infeasible  in  high  dimensions.  On 
the  other  hand,  simulations  of  trajectories  according  to  the 
Langevin  equation  (2.7)  are  easily  performed.  An  interesting 
question,  then,  is  whether  it  is  possible  to  obtain  approxi¬ 
mations  to  these  first  few  eigenfunctions  from  (large  enough) 
data  sampled  from  these  trajectories. 

In  the  previous  section  we  saw  that  the  infinitesimal  gen¬ 
erator  of  the  normalized  graph  Laplacian  construction  corre¬ 
sponds  to  a  Fokker-Planck  operator  with  a  potential  2 U(x), 
see  eq.  (2.6).  Therefore,  in  general,  there  is  no  direct  connec¬ 
tion  between  the  eigenvalues  and  eigenfunctions  of  the  nor¬ 
malized  graph  Laplacian  and  those  of  the  underlying  Fokker- 
Planck  operator  (2.8).  However,  it  is  possible  to  construct 
a  different  normalization  that  yields  infinitesimal  generators 
corresponding  to  the  potential  U(x)  without  the  additional 
factor  of  two. 


Consider  the  following  anisotropic  kernel, 

(2.12)  M*,y)  =  ^=== 

VPe(x)Pe 

A  similar  analysis  to  that  of  the  previous  section  shows  that 
the  normalized  graph  Laplacian  construction  that  corresponds 
to  this  kernel  gives  in  the  asymptotic  limit  the  correct  Fokker- 
Planck  operator,  e.g.,  with  the  potential  U(x). 

Since  the  Euclidean  distance  in  the  diffusion  map  space 
corresponds  to  diffusion  distance  in  the  feature  space,  the 
first  few  eigenvectors  corresponding  to  the  anisotropic  kernel 

(2.12)  capture  the  long-time  asymptotic  behavior  of  the  sto¬ 
chastic  system  (2.7).  Therefore,  the  diffusion  map  can  be  seen 
as  an  empirical  method  for  homogenization  (see  [10]  for  more 
details).  These  variables  are  the  right  observables  with  which 
to  implement  the  equation- free  complex/multiscale  computa¬ 
tions  of  Kevrekidis  et  al  (see  [15]  and  [16]). 


2.4.  One-parameter  family  of  diffusion  maps.  In  the 

previous  sections  we  showed  three  different  constructions  of 
Markov  chains  on  a  discrete  data-set,  that  asymptotically 
recover  either  the  Laplace-Beltrami  operator  on  the  mani¬ 
fold,  or  the  backward  Fokker-Planck  operator  with  potential 
2 U{x)  for  the  normalized  graph  Laplacian,  or  U{pc)  for  the 
anisotropic  diffusion  kernel. 

In  fact,  these  three  normalizations  can  be  seen  as  specific 
cases  of  a  one-parameter  family  of  different  diffusion  maps, 
based  on  the  kernel 


(2.13) 


k(a\x,y) 


ke{x,y) 

Pe(x)Pe(v) 


for  some  a  >  0. 

It  can  be  shown  [9]  that  the  forward  infinitesimal  operator 
generated  by  this  diffusion  is 


(2.14)  nfU  =  A 4>-  (e(1“")c/Ae“(1“Q)t/)  <j> 


One  can  easily  see  that  the  interesting  cases  are:  i)  a  =  0, 
corresponding  to  the  classical  normalized  graph  Laplacian, 
ii)  a  =  1,  yielding  the  Laplace-Beltrami  operator,  and  iii) 
a  =  1/2  yielding  the  backward  Fokker-Planck  operator. 
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Figure  4.  Left:  the  original  function  /  on 
the  unit  square.  Right:  the  first  non-trivial 
eigenfunction.  On  this  plot,  the  colors  corre¬ 
sponds  to  the  values  of  /. 

Therefore,  while  the  graph  Laplacian  based  on  a  kernel 
with  a  =  1  captures  the  geometry  of  the  data,  with  the  den¬ 
sity  e~u  playing  absolutely  no  role,  the  other  normalizations 
take  into  account  also  the  density  of  the  points  on  the  mani¬ 
fold. 

3.  Directed  diffusion  and  learning  by  diffusion 

It  follows  from  the  previous  section  that  the  embedding 
that  one  obtains  depends  heavily  on  the  choice  of  a  diffusion 
kernel.  In  some  cases,  one  is  interested  in  constructing  diffu¬ 
sion  kernels  which  are  data  or  task  driven.  As  an  example, 
consider  an  empirical  function  F(x)  on  the  data.  We  would 
like  to  find  a  coordinate  system  in  which  the  first  coordinate 
has  the  same  level  lines  as  the  empirical  function  F.  For  that 
purpose,  we  replace  the  Euclidean  distance  in  the  Gaussian 
kernel  by  the  anisotropic  distance 

D2e{x,y)  =  d\x,y)/e+\F{x)  -  F(y) \2/e2 

The  corresponding  limit  of  AlJs  is  a  diffusion  along  the  level 
surfaces  of  F  from  which  it  follows  that  the  first  nonconstant 
eigenfunction  of  A£  has  to  be  constant  on  level  surfaces.  This 
is  illustrated  in  Figure  4,  where  the  graph  represents  the  func¬ 
tion  F  and  the  colors  correspond  to  the  values  of  the  first 
non-trivial  eigenfunction.  In  particular,  observe  that  the  level 
lines  of  this  eigenfunction  are  the  integral  curves  of  the  field 
orthogonal  to  the  gradient  of  F.  This  is  clear  since  we  forced 
the  diffusion  to  follow  this  field  at  a  much  faster  rate,  in  ef¬ 
fect  integrating  that  field.  It  also  follows  that  any  differential 
equation  can  be  integrated  numerically  by  a  non-isotropic  dif¬ 
fusion  in  which  the  direction  of  propagation  is  faster  along  the 
field  specified  by  the  equation. 

We  now  apply  this  approach  to  the  construction  of  empir¬ 
ical  models  for  statistical  learning.  Assume  that  a  data  set 
has  been  generated  by  a  process  whose  local  statistical  prop¬ 
erties  vary  from  location  to  location.  Around  each  point  x, 
we  view  all  neighboring  data  points  as  having  been  generated 
by  a  local  diffusion  whose  probability  density  is  estimated  by 


px(y)  =  Cx  exp(— qx{x  —  y))  where  qx  is  a  quadratic  form  ob¬ 
tained  empirically  by  PC  A  from  the  data  in  a  small  neighbor¬ 
hood  of  x  .  We  then  use  the  kernel  a(x,z)  =  f  px{y)Pz{y)dy 
to  model  the  diffusion.  Note  that  the  distance  defined  by 
this  kernel  is  (/  | px(y)  ~  Pz{y)\2dy)  which  can  be  viewed 
as  the  natural  distance  on  the  “statistical  tangent  space”  at 
every  point  in  the  data.  If  labels  are  available,  the  infor¬ 
mation  about  the  labels  can  be  incorporated  by,  for  example, 
locally  warping  the  metric  so  that  the  diffusion  starting  in  one 
class  stays  in  the  class  without  leaking  to  other  classes.  This 
could  be  obtained  by  using  local  discriminant  analysis  (e.g. 
linear,  quadratic  or  Fisher  discriminant  analysis)  to  build  a 
local  metric  whose  fast  directions  are  parallel  to  the  boundary 
between  classes  and  whose  slow  directions  are  transversal  to 
the  classes  (see  e.g.  [1]). 

In  data  classification,  geometric  diffusion  provides  a  pow¬ 
erful  tool  to  identify  arbitrarily  shaped  clusters  with  partially 
labelled  data.  Suppose,  for  example,  we  are  given  a  data  set 
X  with  N  points  from  C  different  classes.  Assume  our  task 
is  to  learn  a  function  L  :  X  — >  {1, . . . ,  C}  for  every  point  in 
X  but  we  are  given  the  labels  of  only  s  «  N  points  in  X. 
If  we  cannot  infer  the  geometry  of  the  data  from  the  label 
points  only,  many  parametric  methods  (e.g.  Gaussian  classi¬ 
fiers)  and  non-parametric  techniques  (e.g.  nearest  neighbors) 
lead  to  poor  results.  In  Figure  3,  we  illustrate  this  with  an  ex¬ 
ample.  Here  we  have  a  hyperspectral  image  of  pathology  tis¬ 
sue.  Each  pixel  (x,  y)  in  the  image  is  associated  with  a  vector 
{I(x,y)}\  that  reflects  the  material’s  spectral  characteristics 
at  different  wavelengths  A.  We  are  given  a  partially  labelled 
set  for  three  different  tissue  classes  (marked  with  blue,  green, 
and  pink  in  3a)  and  are  asked  to  classify  all  pixels  in  the  im¬ 
age  using  only  spectral,  as  opposed  to,  spatial  information. 
Both  Gaussian  classifiers  and  nearest-neighbor  classifiers  (see 
3b)  perform  poorly  in  this  case  as  there  is  a  gradual  change 
in  both  shading  and  chemical  composition  in  the  vertical  di¬ 
rection  of  the  tissue  sample. 

The  diffusion  framework,  however,  provides  an  alternative 
classification  scheme  that  links  points  together  by  a  Markov 
random  walk  (see  also  [17]  for  a  discussion):  let  be  the 
C1  -normalized  characteristic  function  of  the  initially  labelled 
set  from  class  i.  At  a  given  time  £,  we  can  interpret  the 
diffused  label  functions  ( AtXi)i  as  the  posterior  probabilities 
of  the  points  belonging  to  class  i.  Choose  a  time  r  when  the 
margin  between  the  classes  is  maximized,  and  then  define  the 
label  of  a  point  iGlas  the  maximum  a  posteriori  estimate 
L(x\t)  =  argmaxiATx^.  Figure  3c  shows  the  classification 
of  the  pathology  sample  using  the  above  scheme.  The  latter 
result  agrees  significantly  better  with  a  specialist’s  view  of 
correct  tissue  classification. 

In  many  practical  situations,  the  user  may  want  to  refine 
the  classification  of  points  that  occur  near  the  boundaries 
between  classes  in  state  space.  One  option  is  to  use  an  itera¬ 
tive  scheme,  where  the  user  provides  new  labelled  data  where 
needed  and  then  restarts  the  diffusion  with  the  new  enlarged 
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Figure  5.  a:  Pathology  slice  with  partially  labelled  data;  the  3  tissue  classes  are  marked  with  blue,  green 
and  pink,  b:  Tissue  classification  from  spectra  using  1-nearest  neighbors,  c:  Tissue  classification  from 
spectra  using  geometric  diffusion. 


training  set.  However,  if  the  total  data  set  X  is  very  large,  an 
alternative,  more  efficient,  scheme  is  to  define  a  modified  ker¬ 
nel  that  incorporates  both  previous  classification  results  and 
new  information  provided  by  the  user:  for  example,  assign  to 
each  point  a  score  Si(x)  £  [0,1]  that  reflects  the  probability 
that  a  point  x  belongs  to  class  i.  Then  use  these  scores  to  warp 
the  diffusion  so  that  we  have  a  set  of  class-specific  diffusion 
kernels  {A{\i  that  slow  down  diffusion  between  points  with 
different  label  probabilities.  Choose,  for  example,  in  each  new 
iteration,  weights  according  to  ki(x,y)  =  k(x,y)si(x)si(y) 
where  Si  =  ArXi  are  the  label  posteriors  from  the  previous 
diffusion,  and  renormalize  the  kernel  to  be  a  Markov  matrix. 
If  the  user  provides  a  series  of  consistent  labelled  examples, 
the  classification  will  speed  up  in  each  new  iteration  and  the 
diffusion  will  eventually  occur  only  within  disjoint  sets  of  sam¬ 
ples  with  the  same  labels. 


4.  Summary 

In  this  paper,  we  presented  a  general  framework  for  struc¬ 
tural  multiscale  geometric  organization  of  graphs  and  subsets 
of  Mn.  We  introduced  a  family  of  diffusion  maps  that  allow  the 
exploration  of  both  the  geometry,  the  statistics  and  functions 
of  the  data.  Diffusion  maps  provide  a  natural  low-dimensional 
embedding  of  high-dimensional  data  that  is  suited  for  subse¬ 
quent  tasks  such  as  visualization,  clustering,  and  regression. 
In  part  II  of  this  paper,  we  introduce  multiscale  methods  that 
allow  fast  computation  of  functions  of  diffusion  operators  on 
the  data.  We  also  present  a  scheme  for  extending  empirical 
functions. 
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GEOMETRIC  DIFFUSIONS  AS  A  TOOL  FOR 
HARMONIC  ANALYSIS  AND  STRUCTURE 
DEFINITION  OF  DATA 

PART  II:  MULTISCALE  METHODS 


R.R.Coifman  \  S.Lafon  \  A.B.Lee  \  M.Maggioni1, 
B.Nadler  \  F.J. Warner  \  S.W.Zucker  2 


Abstract.  In  the  companion  paper  a  framework  for  struc¬ 
tural  multiscale  geometric  organization  of  subsets  of  Mn  and 
of  graphs  was  introduced.  Here  diffusion  semigroups  are  used 
to  generate  multiscale  analyses  in  order  to  organize  and  repre¬ 
sent  complex  structures.  We  emphasize  the  multiscale  nature 
of  these  problems,  and  we  build  scaling  functions  of  Markov 
matrices  (describing  local  transitions)  that  lead  to  macro¬ 
scopic  descriptions  at  different  scales.  The  process  of  iterating 
or  diffusing  the  Markov  matrix  is  seen  as  a  generalization  of 
some  aspects  of  the  Newtonian  paradigm,  in  which  local  in¬ 
finitesimal  transitions  of  a  system  lead  to  global  macroscopic 
descriptions  by  integration.  This  part  deals  with  the  con¬ 
struction  of  fast  order  N  algorithms  for  data  representation 
and  for  homogenization  of  heterogeneous  structures. 

1.  Introduction 

In  the  companion  paper  [1]  it  is  shown  that  the  eigenfunc¬ 
tions  of  a  diffusion  operator  A  can  be  used  to  perform  global 
analysis  of  the  set  and  of  functions  on  a  set.  Here  we  present 
a  construction  of  a  multiresolution  analysis  of  functions  on 
the  set  related  to  the  diffusion  operator  A.  This  allows  to 
perform  a  local  analysis  at  different  diffusion  scales. 

This  is  motivated  by  the  fact  that  in  many  situations  one  is 
interested  not  in  the  data  itself,  but  in  functions  on  the  data, 
and  in  general  these  functions  exhibit  different  behaviour  at 
different  scales.  This  is  the  case  in  many  problems  in  learn¬ 
ing,  in  analysis  on  graphs,  in  dynamical  systems  etc...  The 
analysis  through  the  eigenfunctions  of  Laplacian  considered  in 
[1]  are  global  and  are  affected  by  global  characteristics  of  the 
space.  It  can  be  thought  of  as  global  Fourier  analysis.  The 
multiscale  analysis  proposed  here  is  in  the  spirit  of  wavelet 
analysis. 

We  refer  the  reader  to  [2,  3,  4]  for  further  details  and  ap¬ 
plications  of  this  construction,  as  well  as  a  discussion  of  the 
many  relationships  between  this  work  and  the  work  of  many 
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other  researchers  in  several  branches  of  mathematics  and  ap¬ 
plied  mathematics.  Here  we  would  like  to  at  least  mention 
the  relationship  with  Fast  Multipole  Methods  [5,  6],  Algebraic 
Multigrid  [7],  lifting  [8,  9]. 

2.  Multiscale  Analysis  of  Diffusion 

2.1.  Construction  of  the  Multiresolution  Analysis.  Sup¬ 
pose  we  are  given  a  self-adjoint  diffusion  operator  A  as  in  [1] 
acting  on  C?  of  a  metric  measure  space  (X,  d,  fi).  We  interpret 
A  as  a  dilation  operator,  and  use  it  to  define  a  multiresolution 
analysis.  It  is  natural  to  discretize  the  semigroup  {At}t> o  of 
the  powers  of  A  at  a  logarithmic  scale,  for  example  at  the 
times 

(2.1)  tj  =  1  +  2  +  22  +  ...  +  2j  =  2j+1  -  1 

For  a  fixed  e  E  (0, 1),  we  define  the  approximation  spaces  by 

(2.2)  Vj  =  <{&:  ^  >e}> 

where  the  </V s  are  the  eigenvectors  of  A,  ordered  by  decreasing 
eigenvalue.  We  will  denote  by  Pj  the  orthogonal  projection 
onto  Vj.  The  set  of  subspaces  {F/jjez  is  a  multiresolution 
analysis  in  the  sense  that  it  satisfies  the  following  properties: 

(i)  limj^_00  Vj  =  £2(X,/i), _ 

limj— ►too  Vj  —  <C  {c f>i  :  Xi  —  1}  >. 

(ii)  Vj+i  C  Vj  for  every  j  E  Z. 

(iii)  {( f>i  :  Xf  >  e}  is  an  orthonormal  basis  for  Vj. 

We  can  also  define  the  detail  subspaces  Wj  as  the  orthog¬ 
onal  complement  of  Vj  in  V)+i,  so  that  we  have  the  familiar 
relation  between  approximation  and  detail  subspaces  as  in  the 
classical  wavelet  multiresolution  constructions: 

Vj*  =  Vj  ©x  Wj. 

This  is  very  much  in  the  spirit  of  a  Littlewood-Paley  de¬ 
composition  induced  by  the  diffusion  semigroup  [10].  How¬ 
ever,  in  each  subspace  Vj  and  Wj  we  have  the  orthonormal 
basis  of  eigenfunctions,  but  we  would  like  to  replace  them  with 
localized  orthonormal  bases  of  scaling  functions  as  in  wavelet 
theory.  Generalized  Heisenberg  principles  (see  also  section  4) 
put  a  lower  bound  on  how  much  localization  can  be  achieved 
at  each  scale  j,  depending  on  the  spectrum  of  the  operator 
A  and  on  the  space  on  which  it  acts.  We  would  like  to  have 
basis  elements  as  much  localized  as  allowed  by  the  Heisenberg 
principle  at  each  scale,  and  spanning  (approximately)  Vj.  We 
do  all  this  while  avoiding  computation  of  the  eigenfunctions. 

We  start  by  fixing  a  precision  e  >  0,  and  assume  that 
A  is  represented  on  the  basis  <f>0  =  {$k}kex-  We  consider 
the  columns  of  A,  which  can  be  interpreted  as  the  set  of 
functions  <I>i  =  {AS  k}  hex  cm  X.  We  use  a  local  multiscale 
Gram-Schmidt  procedure,  described  below,  to  carefully  but 
efficiently  orthonormalize  these  columns  into  a  basis  4>i  = 
Wi^kjkeXi  (Xi  is  defined  as  this  index  set)  for  the  range  of 
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A  up  to  precision  e.  This  is  a  linear  transformation  we  rep¬ 
resent  by  a  matrix  Go-  This  yields  a  subspace  that  is  e-close 
to  Vi.  Essentially  <hi  is  a  basis  for  a  subspace  which  is  e-close 
to  the  range  of  A,  the  basis  elements  that  are  well-localized 
and  orthogonal.  Obviously  \Xi\  <  \X\  but  the  inequality  may 
already  be  strict  since  part  of  the  range  of  A  may  be  below 
the  precision  e.  Whether  this  is  the  case  or  not,  we  have  then 
a  map  Mq  from  X  to  Ji,  which  is  the  composition  of  A  with 
the  orthonormalization  by  Go-  We  can  also  represent  A  in 
the  basis  i :  we  denote  this  matrix  by  A\  and  compute  A2. 
See  the  diagram  in  Figure  1. 

We  now  proceed  by  looking  at  the  columns  of  A2,  which 
are  i>2  =  {AfSk}keXl  =  {A2<pltk}keXl  up  to  precision  e,  by 
unravelling  the  bases  on  which  the  various  elements  are  repre¬ 
sented.  Again  we  can  apply  a  local  Gram-Schmidt  procedure 
to  orthonormalize  this  set:  this  yields  a  matrix  G\  and  an 
orthonormal  basis  $2  =  W2 ,k}kex2  f°r  the  range  of  A2  up 
to  precision  e,  and  hence  for  the  range  of  Aq  up  to  precision 
2e.  Moreover,  depending  on  the  decay  of  the  spectrum  of  A , 
IX2 1  <<  |Xi|.  The  matrix  M\  which  is  the  composition  of 
Gi  with  A2  is  then  of  size  IX2I  x  |Xi|,  and  A2  =  M\M ±  is  a 
representation  of  A4  acting  on  $2. 

After  j  steps  in  this  fashion,  we  will  have  a  representation 
of  A1+2+2  4  h 2 3  =  A2J+  -1  onto  a  basis  <£7  =  {<Pj,k}keXj, 
that  spans  a  subspace  which  is  j e-close  to  Vj.  Depending 
on  the  decay  of  the  spectrum  of  A ,  we  expect  jXj |  <<  |X|, 
in  fact  in  the  ideal  situation3  the  spectrum  of  A  decays  fast 
enough  so  that  there  exists  7  <  1  such  that  \Xj\  <  /y2J+1~1\X\. 
This  subspace  is  spanned  by  “bump”  functions  at  scale  j,  as 
defined  by  the  corresponding  power  of  the  diffusion  operator 
A.  The  “centers”  of  these  bump  functions  can  be  identified 
with  Xj,  which  we  can  think  of  Xj  as  a  coarser  version  of  X. 
The  basis  <£7  is  naturally  identified  with  the  set  of  Dirac  5- 
functions  on  Xj ,  however  can  extend  these  functions,  defined 
on  the  “compressed”  graph  Xj  to  the  whole  initial  graph  X 
by  writing 

,  <Pj,k(x)  =  Mj-m-iAx)  ,xeXj_  1 

=  Mj-iMj-2  ■  ■  Mo  <fio,k(x)  ,x  G  Xo  ■ 

Since  every  function  in  <f>o  is  defined  on  X,  so  is  every  function 
in  <£7.  Hence  any  function  on  the  compressed  space  Xj  can 
be  extended  naturally  to  the  whole  X.  In  particular,  one 
can  compute  low-frequency  eigenfunctions  on  Xj,  and  then 
extend  them  to  the  whole  X.  This  is  of  course  completely 
analogous  to  the  standard  construction  of  scaling  functions  in 
the  Euclidean  setting  [11,  5,  12]. Observe  that  each  point  in  Xj 
can  be  considered  as  a  “local  aggregation”  of  points  in  X7_i, 
which  is  completely  dictated  by  the  action  of  the  operator  A 
on  functions  on  X:  A  itself  is  dictating  the  geometry  with 
respect  to  which  it  should  be  analyzed,  compressed,  applied 
to  any  vector. 

3By  Weyl’s  Theorem  on  the  distribution  function  of  the  spectrum 
of  the  Laplace-Beltrami  operator,  this  is  the  case  when  A  is  an  accu¬ 
rate  enough  discretization  of  the  Laplace-Beltrami  on  a  smooth  compact 
Riemannian  manifold  with  smooth  boundary. 


Figure  1.  Diagram  for  downsampling,  or- 
thogonalization  and  operator  compression. 

We  have  thus  computed  and  efficiently  represented  the 
powers  A2° ,  for  j  >0,  which  describe  the  behaviour  of  the  dif¬ 
fusion  at  different  time  scales.  This  applies  to  the  solution  of 
discretized  of  partial  differential  equations,  of  Markov  chains, 
and  in  learning  and  related  classification  problems. 

2.2.  Wavelet  transforms  and  Green’s  function.  The  con¬ 
struction  immediately  suggests  an  associated  fast  scaling  func¬ 
tion  transform:  suppose  we  are  given  /  on  X  and  want  to 
compute  <  /,  (fj^k  >  for  all  scales  j  and  corresponding  “trans¬ 
lations”  k.  Being  given  /  is  equivalent  to  saying  we  are  given 
(<  /,  <Po,fc  >)kex-  Then  we  can  compute  (<  /,  iplik  >)kex1  = 
M0(<  /,  >)keX:  and  so  on  for  all  scales.  The  matrices 
Mj  are  sparse  (since  Aj  and  Gj  are),  so  this  computation  is 
fast.  This  generalizes  the  classical  scaling  function  transform. 
We  will  see  later  that  wavelets  can  be  constructed  as  well  and 
a  fast  wavelet  transform  is  possible. 

In  the  same  way,  any  power  of  A  can  be  applied  fast  to  a 
function  /.  In  particular  the  Green’s  function  (/  —  A)-1  can 
be  applied  fast  to  any  function:  since 

+00 

{I-A)~1f  =  YJAkf, 

k=  1 

2k 

if  we  let  Sk  =  Ylk=i  ^  we  see  that 

Sk+ 1  =  Sk  +  A2  Sk  =  ^  )  /  > 

k= 0 

and  each  term  of  the  product  can  be  applied  fast  to  /. 

The  construction  of  the  multiscale  bases  can  be  done  in 
time  G(nlog2n),  where  n  =  |X|,  if  the  spectrum  of  A  has 
fast  enough  decay.  The  decomposition  of  a  function  /  onto 
the  scaling  functions  and  wavelets  we  construct  can  be  done 
in  the  same  time,  and  so  does  the  computation  of  (/  —  A)-1/. 

2.3.  The  orthogonalization  process.  We  sketch  here  how 
the  orthogonalization  works:  for  details  refer  to  [3,  2].  Sup¬ 
pose  we  start  from  a  Xlocal  basis  =  {< px}xeT  (in  our  case, 
ipx  is  going  to  be  a  bump  Al5x).  We  greedily  build  a  first 
layer  of  basis  functions  <f>0  =  {^o,^fc}xfce/c05  X0  C  T  as  fol¬ 
lows.  We  let  (po,Xo  t>e  a  basis  function  with  greatest  £2-norm. 
Then  we  let  ipo,x!  a  basis  function  with  biggest  £2-norm 
among  the  basis  functions  with  support  disjoint  from  the  sup¬ 
port  of  cpojXo  but  not  farther  than  S  from  it.  By  induction, 
after  (po,XQ ,  •  •  • ,  have  been  chosen,  we  let  ipo,Xl+1  be  a 
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scaling  function  with  largest  £2-norm  among  those  having  a 
support  which  does  not  intersect  any  of  the  supports  of  the 
basis  functions  already  constructed,  but  is  not  farther  than  5 
from  the  closest  such  support.  We  stop  when  no  such  choice 
can  be  made.  One  can  think  of  JC o  roughly  as  a  25  lattice. 

At  this  point  4>o  in  general  spans  a  subspace  much  smaller 
than  the  one  spanned  by  <f>.  We  construct  a  second  layer 
<Fi  =  {<Pi,Xk}xke)Ci  5  /Ci  C  T  \ /Co  as  follows.  Orthogonalize 
each  {ipx}xeT\) c0  to  tlie  functions  {ipo,Xk}xkeic 0-  Observe  that 
since  the  support  of  px  is  small,  this  orthogonalization  is  local, 
in  the  sense  that  each  (px  needs  to  be  orthogonalized  only  to 
the  few  (p'0  s  that  have  an  intersecting  support.  In  this  way 
we  get  a  set  dq,  orthogonal  to  <f>o  but  not  orthogonal  itself. 
We  ortho  normalize  it  exactly  as  we  did  to  get  <f>o  from  <f>.  We 
proceed  by  building  as  many  layers  as  necessary  to  span  the 
whole  space  <  <f>  >  (up  to  the  specified  precision  e). 

2.4.  Wavelets.  We  would  like  to  construct  bases  {i/jj,k}k  f°r 
the  spaces  Wj,  j  >  1,  such  that  Vj  Wj  =  Vj+ \.  To  achieve 
this,  after  having  built  {<Pj,k}keJCj  and  {(pj+i,k}keicj+1  >  we  can 
apply  our  modified  Gram-Schmidt  procedure  with  geometric 
pivoting  to  the  set  of  functions 

i(Pj  -  Pj+i)(Pj,k}keic:i, 

which  will  yield  an  orthonormal  basis  of  wavelets  for  the 
orthogonal  complement  of  Vj+ %  in  Vj.  Observe  that  each 
wavelet  is  a  result  of  an  orthogonalization  process  which  is 
local,  so  the  computation  is  again  fast.  To  achieve  numerical 
stability  we  orthogonalize  at  each  step  the  remaining  yq+p/ds 
to  both  the  wavelets  built  so  far  and  ipj^k-  Wavelet  subspaces 
can  be  recursively  split  further  to  obtain  diffusion  wavelet 
packets  [4],  which  allow  the  application  of  the  classical  fast 
algorithms  [13]  for  denoising  [14],  compression  [15]  and  dis¬ 
crimination  [16]. 

3.  Examples  and  applications 

Example  3.1  (Multiresolution  diffusion  on  the  homogeneous 
circle).  To  compare  with  classical  constructions  of  wavelets, 
we  consider  the  unit  circle,  sampled  at  256  points,  and  the 
classical  isotropic  heat  diffusion  on  it.  The  initial  orthonormal 
basis  4>o  is  given  by  the  set  of  ^-functions  at  each  point,  and  we 
build  the  diffusion  wavelets  at  all  scales,  which  clearly  relate 
to  splines  and  multi  wavelets.  The  spectrum  of  the  diffusion 
operator  does  not  decay  very  fast.  See  Figure  2  and  3. 
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FIGURE  2.  Diffusion  Multiresolution  Analysis 
on  the  circle.  We  consider  256  points  on  the  unit 
circle,  starting  with  po ,k  —  5k  and  with  the  stan¬ 
dard  diffusion.  We  plot  several  scaling  functions 
in  each  approximation  space  Vj. 


FIGURE  3.  Diffusion  Multiresolution  Analysis 
on  the  circle:  we  plot  the  compressed  matri¬ 
ces  representing  powers  of  the  diffusion  operator, 
in  white  are  the  entries  above  working  precision 
(here  set  to  10-8).  Notice  the  shrinking  of  the 
size  of  the  matrices  which  are  being  compressed 
at  the  different  scales. 


Example  3.2  (Dumbbell).  We  consider  a  dumbbell-shaped 
manifold,  sampled  at  1400  points,  and  the  diffusion  associated 
to  the  (discretized)  Laplace-Beltrami  operator  as  discussed  in 
[1].  See  Figure  4  for  the  plots  of  some  scaling  functions  and 
wavelets:  they  exhibit  the  expected  locality  and  multiscale 
features,  dependent  on  the  intrinsic  geometry  of  the  mani¬ 
fold. 

Example  3.3  (Multiresolution  diffusion  on  a  nonhomogenous 
circle) .  We  can  apply  the  construction  of  diffusion  wavelets  to 


non-isotropic  diffusions  arising  from  partial  differential  equa¬ 
tions,  to  tackle  problems  of  homogenization  in  a  natural  way. 
The  literature  on  homogenization  is  vast,  see  e.g.  [17,  18,  19, 
20,  21]  and  references  therein. 

Our  definition  of  scales  which  is  driven  by  the  differen¬ 
tial  operator,  which  in  general  results  in  highly  nonuniform 
and  nonhomogeneous  spatial  and  spectral  scales,  and  in  cor¬ 
responding  coarse  equations  of  the  system,  which  have  high 
precision. 
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FIGURE  4.  Some  diffusion  scaling  functions 
and  wavelets  at  different  scales  on  a  dumbbell- 
shaped  manifold  sampled  at  1400  points. 
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FIGURE  5.  Multiresolution  Diffusion  on  a  cir¬ 
cular  medium  with  non-constant  diffusion  co¬ 
efficient.  Top:  several  scaling  functions  and 
wavelets  in  different  approximation  subspaces 
Vj :  notice  that  scaling  functions  at  the  same  dif¬ 
fusion  scale  exhibit  different  spatial  localization, 
which  depends  on  the  local  diffusion  coefficient. 
Bottom:  matrix  compression  of  the  dyadic  pow¬ 
ers  of  T  on  the  scaling  function  bases  of  the  V}’ s: 
notice  the  size  of  the  matrices  shrinking  with 
scale. 


For  example  we  can  consider  the  non- homogeneous  heat 
equation  on  the  circle 


(3.1) 


du 

9t 


d 

dx 


where  c(x)  is  a  positive  function  close  0  at  certain  points  and 
almost  1  at  others.  We  want  to  represent  the  intermediate 
and  large  scale/time  behavior  of  the  solution  by  compressing 
powers  of  the  operator  representing  the  discretization  of  the 
spatial  differential  operator  [c{x)-^).  The  spatial  differ¬ 
ential  operator  on  the  right-hand  side  of  (3.1)  is  a  matrix  T 
which,  when  properly  normalized,  can  be  interpreted  as  a  non¬ 
translation  invariant  random  walk.  Our  construction  yields 
a  multiresolution  associated  to  this  operator  that  is  highly 
nonuniform,  with  most  scaling  functions  concentrated  around 
the  points  where  the  conductivity  is  highest,  for  several  scales. 
The  compressed  matrices  representing  the  (dyadic)  powers  of 
this  operator  can  be  viewed  as  multiscale  homogenized  ver¬ 
sions,  at  a  certain  scale  which  is  time  and  space  dependent, 
of  the  original  operator,  see  Figure  5. 

While  the  examples  above  illustrate  classical  settings,  the 
construction  of  diffusion  wavelets  carries  over  unchanged  to 
weighted  graphs,  by  considering  the  generator  of  the  diffusion 
associated  to  the  natural  random  walk  (and  Laplacian)  on 
the  graph.  It  then  allows  a  natural  multiscale  analysis  of 
functions  of  interest  on  such  a  graph.  We  expect  this  to  have 
a  wide  range  of  applications  to  the  analysis  of  large  data  sets, 
document  corpora,  network  traffic,  et  ah,  which  are  naturally 
modelled  by  graphs. 

4.  Extension  of  empirical  functions  off  the  data 

set 

An  important  aspect  of  the  multiscale  developed  so  far  in¬ 
volves  the  relation  of  the  spectral  theory  on  the  set  to  the 
localization  on  and  off  the  set  of  the  corresponding  eigenfunc¬ 
tions  and  diffusion  scaling  functions  and  wavelets.  In  addi¬ 
tion  to  the  theoretical  interest  of  this  topic,  the  extension  of 
functions  defined  on  a  set  X  to  a  larger  set  X  is  of  critical  im¬ 
portance  in  applications  such  as  statistical  learning.  To  this 
end,  we  construct  a  set  of  functions,  termed  geometric  har¬ 
monics ,  that  allow  to  extend  a  function  /  off  the  set  X,  and 
we  explain  how  this  provides  a  multiscale  analysis  of  /.  For 
a  more  detailed  studied  of  geometric  harmonics,  the  reader  is 
referred  to  [22]. 

4.1.  Construction  of  the  extension:  the  geometric  har¬ 
monics.  Let’s  specify  the  mathematical  setting.  Let  X  be  a 
set  contained  in  a  larger  set  X,  and  /i  be  a  measure  on  X. 
Suppose  that  one  is  given  a  positive  semi-definite  symmetric 
kernel  &(•,•)  defined  on  X  x  X,  and  if  /  is  defined  on  X,  let 
K  :  Z/2(X,  p)  L2(X,/i)  be  defined  by 

Kf{x)=  [  k(x,  y)  f  (y)dfi(y) . 

Jx 

Let  {'ipj }  and  {A2}  be  the  eigenfunctions  and  eigenvalues  of 
this  operator.  Note  that  under  weak  hypotheses,  the  operator 
K  is  compact,  and  its  eigenfunctions  form  a  basis  of  L2(X,  pf). 
Then  by  definition,  if  A2  >0,  then 

1  If 

=  ^2  Jx  Hx,y)^j(y)My)  > 
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where  this  identity  holds  for  x  E  X.  Now  if  we  let  x  be  in 
X,  the  right-hand  side  of  this  equation  is  well-defined,  and 
this  allows  to  extend  ipj  as  a  function  •  defined  on  X.  This 
procedure,  that  goes  by  the  name  of  Nystrom  extension,  has 
been  already  suggested  to  overcome  the  problem  of  large  scale 
data  sets  [23],  and  to  speed  up  the  data  processing  [24]. 

From  the  above,  each  extension  is  constructed  as  an  inte¬ 
gral  of  the  values  over  the  smaller  set  X,  and  consequently 
verifies  some  sort  of  mean  value  theorem.  We  call  these  func¬ 
tions  geometric  harmonics. 

From  the  numerical  analysis  point  of  view,  one  has  to  be 
careful  as  A j  — >  0  as  j  — >  Too,  and  one  can  extend  only  the 
eigenfunctions  'ipj  for  which  A2  >  £Aq,  where  S  >  0  is  preset 
number.  We  can  now  safely  define  the  extension  of  function 
/  from  X  to  X  by 

f(x)  —  52 

xpsxl 

for  x  E  X,  where  (•,  -)x  is  the  inner  product  of  L2(X,  /i). 
This  way,  the  extension  operation  has  condition  number 
We  immediately  notice  that  for  /  to  approximately  coincide 
with  /  on  X,  one  must  have  that  most  of  the  energy  of  /  be 
concentrated  in  the  first  few  eigenfunctions  ipj. 

Let’s  give  three  examples  of  geometric  harmonics.  The  first 
example  is  related  to  potential  theory.  Assume  that  X  is  a 
smooth  closed  hypersurface  of  Mn  =  X,  dfi  =  dx  and  consider 
the  Newtonian  potential  in  Mn: 

—  log(||ar  —  3/||)  if  n  =  2, , 
ifn^3- 

Then  the  geometric  harmonics  have  the  form 

y  (x) =  fx  k(x'  > 

and  are  obviously  harmonic  in  the  domain  with  boundary  X. 
If  /  is  a  function  on  X  representing  the  single  layer  density 
of  charges  on  X,  then  the  extension  /  is,  by  construction,  a 
sum  of  harmonic  functions,  and  is  an  harmonic  extension  of 
/. 

For  the  second  example,  consider  a  Hilbert  basis  {ej}jez 
of  a  subspace  V  of  L2(lRn)  fi  C(Wn).  For  instance,  this  could 
be  a  wavelet  basis  of  some  finite  scale  shift- invariant  space. 
Then  the  diagonalization  of  the  restriction  of  kernel 

Hx,y)  =  ^2en(x)e*n(y) 

j€Z 

to  a  set  X  generates  geometric  harmonics,  and  an  extension 
procedure  of  empirical  functions  on  X  to  functions  of  V. 

The  third  example  is  of  particular  importance  as  it  general¬ 
izes  the  Prolate  Spheroidal  Wave  Functions  introduced  in  the 
context  of  signal  processing  by  [25,  26].  Assume  that  X  C  Mn 
and  consider  the  space  Vb  of  bandlimited  functions  with  fixed 
band  B  >  0  (we  call  these  functions  B— bandlimited).  Fol¬ 
lowing  the  procedure  explained  in  the  second  example,  we  can 
construct  geometric  harmonics  {V’j}  that  are  B— bandlimited. 


It  can  be  shown  that  this  comes  down  to  diagonalizing  the 
kernel 

2iir(£,x)  -2i7r(€,y)  _  J%(27tB\\x  V\\) 

^  §x-y\\S 

where  x  and  y  belong  to  X,  and  Ju  is  the  Bessel  function  of 
the  first  type  and  of  order  v.  From  the  first  equality  sign,  we 
see  that  the  geometric  harmonics  arise  from  a  Principal  Com¬ 
ponent  Analysis  of  the  set  of  all  restrictions  of  B— bandlimited 
complex  exponentials  to  X. 

It  can  verified  that,  in  addition  to  be  orthogonal  on  the 
set  X,  these  B— bandlimited  geometric  harmonics  are  also 
orthogonal  over  the  whole  space  Mn.  Moreover,  'ipj  minimizes 
the  Rayleight  quotient 

fun  \f(x)\2dx 
fx  \f(x)\2dx 

under  the  constraint  that  /  be  orthogonal  to  {V’o?  Vb-i}- 

In  other  words,  0  is  the  B— bandlimited  extension  of  ijjj 
that  has  minimal  energy  on  Mn.  As  a  consequence,  /  is  the 
B— bandlimited  extension  of  /  that  has  minimal  energy  off 
the  set  X.  This  type  of  extension  is  optimal  in  the  sense  that 
it  is  the  average  of  all  B— bandlimited  extension  of  /.  It  also 
suggests  that  this  extension  satisfies  Occam’s  razor  in  that 
it  is  the  “simplest”  among  all  bandlimited  extensions:  any 
other  extension  is  equal  to  /  plus  an  orthogonal  bandlimited 
function  that  vanishes  on  X. 

4.2.  Multiscale  extension.  For  a  given  function  /  on  X,  we 
have  constructed  a  minimal  energy  B— bandlimited  extension 
/.  In  the  case  when  X  is  a  smooth  compact  submanifold  of 
Mn,  we  can  now  relate  the  spectral  theory  on  the  set  X  to 
that  on  Mn. 

On  the  one  hand,  any  band  limited  function  of  band  B  >  0 
restricted  to  X  can  be  expanded  to  exponential  accuracy  in 
terms  of  the  eigenfunctions  of  the  Laplace-Beltrami  opera¬ 
tor  A  with  eigenvalues  u2  not  exceeding  CB2  for  some  small 
constant  C  >  0.  On  the  other  hand,  it  can  be  shown  that  ev¬ 
ery  eigenfunction  of  the  Laplace-Beltrami  operator  satisfying 
this  condition  extends  as  a  bandlimited  function  with  band 
C'B.  Both  of  these  statements  can  be  proved  by  observing 
that  eigenfunctions  on  the  manifold  are  well  approximated  by 
restrictions  of  bandlimited  functions. 

We  conclude  that  any  empirical  function  /  on  X  that  can 
be  approximated  as  a  linear  combination  of  eigenfunctions  of 
A,  and  these  eigenfunctions  can  be  extended  to  different  dis¬ 
tances:  if  the  eigenvalue  is  z/2,  then  the  corresponding  eigen¬ 
function  can  be  extended  as  a  v— bandlimited  function  off  the 
set  X  to  a  distance  Cv~x .  This  observation  constitutes  a 
formulation  of  the  Heisenberg  principle  involving  the  Fourier 
analysis  on  and  off  the  set  X,  and  which  states  that  any  em¬ 
pirical  function  can  be  extended  as  a  sum  of  “atoms”  whose 
numerical  supports  in  the  ambient  space  is  related  to  their 
frequency  content  on  the  set. 

The  generalized  Heisenberg  principle  is  illustrated  on  figure 

4.2,  where  we  show  the  extension  of  the  functions  fj(0)  = 


kB 0 


’,y)=  [ 
J\u 
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cos(27r/$)  for  j  =  1,2,3  and  4,  from  the  unit  circle  to  the 
plane.  For  each  function,  we  used  gaussian  kernels,  and  the 
scale  was  adjusted  as  the  maximum  scale  that  would  preserve 
a  given  accuracy. 


Gaussian  extension  of  COS(0)  Gaussian  extension  of  cos(2H) 


FIGURE  6.  Extension  of  the  functions  fj(0)  — 
cos(27 rjO)  for  j  —  1,2,3  and  4,  from  the  unit 
circle  to  the  plane. 


5.  Conclusion 

We  have  introduced  a  multiscale  structure  for  the  efficient 
computation  of  large  powers  of  a  diffusion  operator,  and  its 
Green’s  function,  based  on  a  generalization  of  wavelets  to 
the  general  setting  of  discretized  manifolds  and  graphs.  This 
has  application  to  the  numerical  solution  of  partial  differen¬ 
tial  equations,  and  to  the  analysis  of  functions  on  large  data 
sets  and  learning.  We  have  shown  that  a  global  (with  eigen¬ 
functions  of  the  Laplacian)  or  local  (with  diffusion  wavelets) 
analysis  on  a  manifold  embedded  in  Euclidean  space  can  be 
extended  outside  the  manifold  in  a  multiscale  fashion  using 
band-limited  functions. 
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Abstract 

We  extend  the  concept  of  good  continuation  in  a  uni¬ 
form  fashion  from  boundaries  to  shading,  hue,  and  tex¬ 
ture.  Each  has  the  property  that  local  measurements  yield 
an  orientation,  which  we  explicitly  establish  for  hue  using 
geometric  harmonic  techniques.  Good  continuation  arises 
in  a  geometric  sense,  because  these  orientations  all  vary 
smoothly  in  an  appropriate  sense.  Thus  they  correspond  to 
flows.  Taken  together  they  define  a  layered  set  of  flows,  in 
the  sense  the  “horizontal”  computations  within  each  flow 
provide  global  consistency  while  “vertical”  computations 
across  flows  enable  the  identification  of  shading  and  shad¬ 
owing  and  different  types  of  edges.  Evidence  is  reviewed 
that  primate  visual  systems  enjoy  such  an  organization. 1 

“...space  and  color  are  not  distinct  elements  but, 
rather,  are  interdependent  aspects  of  a  unitary  pro¬ 
cess  of  perceptual  organization.”  Kanizsa  [17] 

1.  Introduction 

Image  segmentation  is  normally  taken  to  be  that  pro¬ 
cess  of  partitioning  the  image  into  a  complete  cover  of  non¬ 
overlapping  regions,  with  the  boundaries  of  these  regions 
related  to  the  (projected)  boundaries  of  objects  in  the  world. 
One  source  of  complexity  in  this  process  is  shadowing,  by 
which  image  intensities  vary  both  as  a  function  of  surface 
orientation  (e.g.,  shading)  and  as  a  function  of  light  sources 
(e.g.,  cast  shadows).  Land’s  retinex  theory  [19]  suggested 
one  way  to  manage  this  complexity,  by  ascribing  abrupt  im¬ 
age  changes  to  material  (or  reflectance)  discontinuities  and 
smooth  gradient  changes  to  lighting.  This  developed  into 
the  intrinsic  image  concept  [30],  which  emphasized  that 
surface  properties,  geometry,  and  lighting  all  map  into  the 
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image,  and  suggested  representing  them  separately  as  im¬ 
ages.  Undoing  this  map  clearly  involves  an  inverse  prob¬ 
lem,  which  requires  a  model  of  some  sort.  One  possibility 
is  to  try  to  learn  the  context  of  every  possible  measurement, 
a  type  of  pseudoinverse  [28].  Here  we  extend  the  notion 
of  context  in  a  different  way,  by  considering  natural  images 
such  as  those  in  Fig.  1.  Notice  how  space, reflectance,  and 
lighting  conspire  together.  We  seek  to  find  a  representation 
rich  enough  to  support  unwinding  this. 

The  first  requirement  for  such  a  representation  is  that  it 
be  rich  enough  to  capture  the  above  phenomena.  But  un¬ 
like  special  purpose  algorithms  applicable  in  one  situation 
(e.g.,  [16,  13]),  our  second  requirement  is  that  it  be  gen¬ 
eral  purpose.  That  is,  the  information  that  it  makes  explicit 
must  support  computations  for  unraveling  many  such  phe¬ 
nomena. 

We  do  not  yet  have  a  formal  solution  to  this  problem 
that  we  can  prove  is  complete.  Instead,  and  consistent  with 
the  goals  of  this  Workshop,  we  develop  an  argument  based 
on  a  neurobiological  analogy,  several  steps  of  which  have 
been  formalized  and  are  complete.  The  demonstrations  in 
the  final  section  of  this  paper  involve  phenomena  beyond 
the  current  capability  of  any  single  existing  algorithm,  and 
provide  counterexamples  to  many.  Constructively,  however, 
we  submit  that  any  final  solution  will  have  an  intermedi¬ 
ate  representation  at  least  as  rich  as  the  one  we  describe. 
Thus  we  see  the  contribution  of  this  Workshop  submission 
as  consisting  of  (i)  an  enlargement  of  the  framework  for  per¬ 
ceptual  organization  informed  by  (ii)  the  rich  foundation  for 
perceptual  organization  in  primate  visual  systems. 

The  core  of  our  argument  is  that  good  continuation  ap¬ 
plies  to  several  key  domains:  boundaries,  intensity  (shad¬ 
ing);  hue;  texture;  saturation,  and  so  on,  all  of  which  enjoy 
a  certain  differential  geometric  structure.  It  is  this  struc¬ 
ture  that  relates  to  the  Gestalt  notion  of  good  continuation. 
Computationally  we  propose  a  layered  representation — 
similar  in  spirit  to  intrinsic  images  [30] — but  different  in 
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Figure  1.  The  rich  interaction  between  surfaces,  lighting,  pigmen¬ 
tation,  and  atmosphere  work  together  to  provide  a  diversity  of  ap¬ 
pearance  phenomena  in  natural  images.  To  simply  claim  that  “ap¬ 
ples  are  red”  or  “bananas  are  yellow”  or  “the  sky  is  blue”  amounts 
to  an  assumption  that  physical  processes  in  the  world  are  constant 
in  a  way  that  only  artifi  cial  examples  can  really  achieve. 


that  all  share  the  property  that  they  are  flows  in  a  technical 
sense.  This  is  what  we  meant  by  layered  flows  implied  in 
the  title,  and  computations  across  these  flows  then  reflect 
subtle  lighting,  surface,  and  space  interactions. 

Fig.  1  illustrates  this  point  in  several  different  domains 
(see  also  [  •  ] .  Apples  are  not  a  single  color;  rather,  fruits 
mature  differentially  and  this  is  reflected  in  their  pigmen¬ 
tation.  Attempts  to  remove  these  slow  variations  as  light¬ 
ing  are  one  reason  why  lightness  and  color  constancy  algo¬ 
rithms  have  problems.  Atmospheric  depth  effects  impose  a 
blue  tint  with  distance  because  of  increased  scattering  and 
in  spite  of  surface  reflection  effects.  Mutual  illumination 
and  color  bleeding  mix  everything. 

We  approach  the  lift  of  these  images  into  layered  flows 
in  two  stages,  both  of  which  are  mathematical  but  motivated 
by  biology.  We  concentrate  on  one  flow  (from  the  color 
pathway)  because,  as  will  become  clear  below,  the  others  fit 
naturally  into  our  framework  and  are  more  widely  discussed 
in  the  literature.  Specifically,  we  first  consider  the  question 
of  how  to  represent  color  information  as  a  dimensionality- 
reduction  problem,  which  leads  formally  to  intensity-hue- 
saturation  coordinates  at  each  point.  This  is  important  for 
us,  because  it  suggests  that  there  is  more  to  color  process¬ 
ing  than  simple  detection  tasks  (consider:  locate  a  red  fruit 
among  green  foliage  [/  ])  for  which  the  standard  cone  pig¬ 
ments  are  tuned.  We  next  consider  (hue)  interactions  be¬ 
tween  points  and  adopt  a  technique  previously  used  to  de¬ 
noise  color  patterns  to  articulate  the  flow  of  hue  across  im¬ 
age  coordinates.  The  resultant  computations  are  then  run  on 
the  examples  in  Fig.  1. 


2.  Representation  of  Color  at  a  Point 

Take  as  data  the  Munsell  patches  considered  as  points 
in  wavelength  space.  While  wavelength- space  is  rather 
high-dimensional,  our  strategy  is  motivated  by  the  obser¬ 
vation  that  colors  are  not  randomly  distributed  thoughout 
wavelength  space,  but  rather  occupy  only  a  small  portion 
of  it.  One  possibility,  suggested  by  the  visual  photopig¬ 
ments  in  primates,  is  that  this  structured  space  of  colors 
is  3 -dimensional.  While  this  is  a  classical  view  of  color, 
many  of  the  classical  algorithms  have  been  modified  in  an 
ad  hoc  fashion  to  take  account  of  non-linearities  among  col¬ 
ors  (e.g.,  Multi-Dimensional  Scaling).  For  this  reason  we 
use  a  new  algorithm  ([10,  11])  derived  from  the  geomet¬ 
ric  harmonics  (reviewed  below)  that  can  handle  inherently 
non-linear  data.  It  is  in  the  class  of  spectral  methods,  and  is 
related  to  h  ]. 

2.1.  Geometric  Harmonics 

Let  X  =  be  the  set  of  data  points,  in 

this  case  Munsell  patches,  with  each  Xi  G  Rn.  We  seek  to 
find  a  projection  of  these  data  into  much  lower  dimension, 
under  the  assumption  that  they  are  not  randomly  distributed 
thoughout  Rn  but  rather  that  they  lie  on  (or  near)  a  lower¬ 
dimensional  manifold  embedded  in  Rn. 

The  structure  of  the  data  are  revealed  via  a  symmet¬ 
ric,  positivity-preserving,  and  positive  semi-definite  kernel 
k(x,y),  which  provides  a  measure  of  similarity  between 
data  points.  The  result  is  a  graph,  with  edges  between 
nearby  (according  to  the  similarity  kernel)  data  points.  (The 
similarity  value  can  be  truncated  to  0  for  all  but  very  similiar 
points.) 

From  this  we  construct  a  diffusion  kernel  a(x,y)  on  the 
data  set  using  the  weighted  graph  Laplacian  normalized  as 
follows: 


where  v  =  ^2yeX  k(xiV)-  Note  that,  although  symmetry 
is  lost,  we  do  have  J2yeX  a(xi  v)  =  so  kernel  a(x,  y ) 
can  be  interpreted  as  the  transition  matrix  of  a  Markov  chain 
on  the  data  X.  The  kernel  a of  the  mth  power  of  this 
matrix  then  represents  the  probability  of  getting  from  xioy 
in  m  steps. 

If  we  now  define  the  averaging  operator  for  a  function  / 
defined  on  the  data: 

4f0)  =  X  a(x’y)f(y)  (2) 

vex 

then  A  admits  a  spectral  theory.  To  develop  this  we  sym¬ 
metrize  a  by: 

a(x,y)  =  ^fX=a(x,y)  (3) 

vMy) 


129 


which  makes  a  symmetric  and  positive  semi-definite  (al¬ 
though  no  longer  row- stochastic).  The  spectral  decompo¬ 
sition  is  then  given  by  a  =  JV>0  with  the 

important  consequence 

a(m)  (x,y)  =  ^2  Xf m(t>j ( V )  (4) 

z>0 


where  Ao  =  1. 

Increasing  powers  of  the  operator  A  can  be  obtained  by 
running  the  chain  through  the  spectral  decomposition.  This 
gives  rise  to  the  family  of  diffusion  maps  {&m}meN  given 
by 


$m0) 


(  A£Vo(x) 

V 


(5) 


Diffusion  distances  D^x^y)  =  + 

a (; y ,  y)  —  2 (x,  y)  within  the  high-dimensional  mea¬ 
surement  space  then  approximate  Euclidean  distance  in  the 
diffusion  map  space. 


2.2.  The  Munsell  Color  Space 

The  Munsell  [22]  patches  were  chosen  according  to  hu¬ 
man  psychophysics,  with  each  step  between  patches  per¬ 
ceptually  equal,  and  they  are  now  known  to  be  physiologi¬ 
cally  relevant  [31,  29,  1:  ].  Thus  they  represent  data  span¬ 
ning  those  portions  of  color  space  relevant  to  our  interac¬ 
tions  with  the  visible  world.  We  now  seek  to  understand 
whether  these  data  lie  on  or  near  a  well-defined  structure  in 
wavelength-  space . 

Two  experiments  were  performed.  We  used  N  =  1269 
patches,  each  with  n  =  421  wavelengths  (380nm  -  800nm 
in  lnm  steps).  The  kernel  is  exp (— d^/cr)  where  dij  is  the 
Euclidian  distance  between  patch  i  and  patch  j.  While  the 
patch  data  are  given  in  no  particular  order,  the  geometric 
harmonic  map  arranges  them  so  that  patches  are  close  to 
one  another  provided  the  diffusion  distance  between  them 
in  wavelength  space  is  small.  The  results  are  shown  in 
Fig.  2.  Note  that  the  natural  representation  emerges — 
intensity,  hue,  saturation — even  though  the  hue  (color  cir¬ 
cle)  is  non-linear.  The  diffusion  maps  recover  the  Munsell 
representation,  thus  demonstrating  that  the  structure  is  in 
the  wavelength  data.  In  the  second  experiment  we  first  pro¬ 
jected  the  wavelength  data  through  the  human  cone  pho¬ 
topigments;  and  again  the  color  circle  emerged  (Fig.  2,  bot¬ 
tom). 


3.  Spatio-spectral  Interactions 

Now  that  we  know  there  is  a  preferred  representation  for 
color  at  a  point,  we  next  consider  the  question  of  how  col¬ 
ors  interact  between  nearby  points.  We  first  observe  that 
the  primate  visual  system  is  well  organized  to  address  this 
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Figure  2.  Geometric  harmonics  organize  Munsell  color  patches, 
(top  row,  left)  Typical  “page”  of  the  patch  data  used  in  the 
experiment.  Data  from  http  :  / / spectral .  joensuu  .  f  i- 
/databases/download/munsell_spec_matt .  htm. 
(right)  Classical  intensity,  hue,  saturation  color  space.  Note  that 
hue  is  organized  around  the  circle,  (middle  row)  The  geometric 
harmonic  organization  of  the  Munsell  data.  Each  point  represents 
a  single  patch,  and  the  scatterplots  show  the  distribution  of  points 
in  the  subspace  spanned  by  the  fi  rst  three  non-trivial  eigenfunc¬ 
tions.  Two  views  are  shown,  with  (left)  illustrating  different 
clusters  according  to  the  Munsell  chromaticity  parameters  and 
(right)  a  view  showing  the  hue  circle.  That  this  non-linear 
organization  of  the  data  is  recovered  by  geometric  harmonics 
is  significant  because  it  provides  the  foundation  for  the  next, 
geometric  stage  of  processing,  (bottom  row)  Organization  of 
the  Munsell  data  first  projected  through  the  three  human  cone 
photopigments.  Since  the  two  views  are  essentially  the  same  as 
(middle),  the  Munsell  representation  is  largely  invariant  to  the 
order  of  projection. 


problem.  While  it  is  widely  accepted  that  perceptual  orga¬ 
nization  is  first  accomplished  via  the  long-range  horizontal 
connections  in  superficial  VI,  consideration  of  these  con¬ 
nections  has  been  limited  to  orientation  good  continuation 
for  boundaries  ([24, 1,2])  and  textures  ([  ]).  However,  there 
exists  a  specialized  structure  for  color  (and  contrast)  infor¬ 
mation  in  the  cytochrome  oxidase  blobs,  within  which  neu¬ 
rons  also  enjoy  long-range  horizontal  interactions  (Fig.  3 
[2  ]).  We  submit  that  it  is  precisely  these  connections  that 
implement  a  geometry  for  hue  (and  color)  that  is  formally 
analagous  to  that  for  texture[7]  and  shading  [9,  2  ]  flows.  A 
sketch  of  this  geometry  is  developed  next.  The  extension  to 
include  boundaries  is  in  [5]. 
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Figure  3.  The  cytochrome  oxidase  blobs  in  superfi  cial  primate  vi¬ 
sual  cortex  are  specialized  for  the  processing  of  color.  The  (left) 
fi  gure  shows  the  blobs  selectively  stained  to  highlight  their  lo¬ 
cations  regularly  interspersed  between  orientation  hypercolumns, 
(right)  When  single  cells  are  fi  lied  with  dye,  their  long-range  con¬ 
nections  become  clear.  Note  how  axons  tend  to  terminate  within 
(or  near)  other  cytochrome  oxidase  blobs  (drawn  in  outline).  We 
submit  that  it  is  these  long-range  connections  that  enforce  “good 
continuation”  between  hues  at  nearby  positions.  Images  courtesy 
of  E.  Callaway,  Salk  Insitute. 

3.1.  Geometry  of  Hue  Fields 

Within  the  (intensity,  hue,  saturation)  color  space,  the 
hue  component  across  the  image  is  a  mapping  H  :  M2  — » 
S1  and  thus  can  be  represented  as  a  unit  length  vector  field 
over  the  image.  In  many  images  this  hue  field  is  piecewise 
smooth  (Fig.  4)  with  singularities  corresponding  to  signif¬ 
icant  scene  events  (e.g.,  occlusion  boundaries  or  material 
changes). 

The  frame  field  [2;  ]  obtained  by  attaching  a  (tangent, 
normal)  frame  {Et,Ejv}  to  each  point  in  the  image  do¬ 
main  is  the  representation  suggested  by  modem  differential 
geometry.  This  provides  a  local  coordinate  system  in  which 
the  hue  vector  and  related  structures  can  be  represented. 
Most  importantly  among  these  are  the  covariant  derivatives 
of  Et  and  Ejy,  which  represent  the  initial  rate  of  change  of 
the  frame  when  it  is  moved  in  a  direction  v  expressed  by  the 
connection  equation  [23]: 


relate  to  each  other  based  on  these  curvatures.  Put  differ¬ 
ently,  measuring  a  particular  curvature  pair  (kt(q),  k n(q )) 
at  a  point  q  should  induce  a  field  of  coherent  measurements, 
i.e.,  a  hue  function  HUE(x,y),  in  the  neighborhood  of  q. 
Coherence  of  HUE(q)  to  its  spatial  context  HUE(x,y) 
can  then  be  determined  by  examining  how  well  HU E(x,y) 
fits  HU E(x,  y)  around  q.  Clearly,  this  should  be  a  function 
of  the  local  hue  curvatures  (k>t(q)^  ^n(q))9  it  should  agree 
with  these  curvatures  at  q,  and  it  should  extend  around  q 
according  to  some  variation  in  both  curvatures 

While  many  local  coherence  models  HUE(x^y)  are 
possible,  we  exploit  the  fact  that  the  hue  field  is  a  unit  length 
vector  field  which  suggests  that  it  behaves  similarly  to  ori¬ 
ented  texture  flows  [6,  7]  and  adopt  a  similar  curvature- 
tuned  local  model. 


HUE  {x ,  y)  =  tan  1 


/  Kppx+_Ki w(g)y  \ 

\1  +  Ki v(q)x  -  KT(q)y) 


(8) 


Unlike  texture  flows,  however,  the  local  model  for  the  hue 
function  is  not  a  double  helicoid  since  the  hue  function 
takes  values  in  [7r,7r)  where  texture  flows  are  constrained 

t°  [-!>!)• 

This  local  model  possesses  many  properties  that  suit 
good  continuation;  in  particular  it  is  both  a  minimal  surface 
in  the  (x,  y,  HU  E(x,  y))  representation  and  a  critical  point 
of  the  p-harmonic  energy  for  all  p.  It  is  also  the  only  local 
model  that  does  not  bias  the  changes  in  one  hue  curvature 
relative  to  the  other,  i.e.,  it  satisfies 


Kr jx,y) 
«/ v(x,y) 


=  const  = 


«r(g) 

kn(q) 


Examples  of  the  model  for  different  curvature  tuning  is  il¬ 
lustrated  in  Fig  5.  A  detailed  technical  account  of  the  model 
in  the  texture  flow  domain  can  be  found  in  [7] . 


(  VvEt  \  _  0  wi2{V)  (  Et  \ 

V  vven  )  -  [  -w12(V)  0  J  V  en  )  w 

The  coefficient  wi2(V)  is  a  function  of  the  tangent  vector 
V,  which  represents  the  fact  that  the  local  behavior  of  the 
flow  depends  on  the  direction  along  which  it  is  measured. 
wi2(V)  is  a  linear  1-form,  so  it  can  be  represented  with  two 
scalars  at  each  point: 


kt  =  ^12  (Et)  n s 

A  v  ' ) 

KN  =  W>12(En)  • 

We  call  kt  the  hue’s  tangential  curvature  and  kn  the  hue’s 
normal  curvature  -  they  represent  the  rate  of  change  of  the 
hue  in  the  tangential  and  normal  directions,  respectively. 

Since  the  local  behavior  of  the  hue  is  characterized  (up  to 
Euclidean  transformation)  by  a  pair  of  curvatures,  it  is  nat¬ 
ural  to  conclude  that  nearby  measurements  of  hue  should 


4.  Examples  of  Flows 

We  now  illustrate  the  above  computations  on  several  ex¬ 
amples.  We  begin  with  artificial  ones,  to  illustrate  the  points 
most  clearly,  then  proceed  to  natural  ones  to  illustrate  the 
complexities  that  arise. 

We  stress  that,  for  space  reasons,  some  of  these  flows  are 
not  visible  unless  one  zooms  in  to  enlarge  the  manuscript. 

In  the  first  Fig.  6,  we  show  one  of  the  few  examples  from 
the  psychophysical  literature.  In  an  important  paper,  King¬ 
dom  [18]  created  images  consisting  of  superimposed  sinu¬ 
soids,  one  in  brightness  and  the  other  in  color.  He  demon¬ 
strated  that  it  is  the  intensity  component  that  drives  the  im¬ 
pression  of  shape-from- shading,  while  the  color  informa¬ 
tion  appears  “painted”  onto  the  undulating  surface.  We  re¬ 
produced  this  separation  with  our  flows,  from  which  it  fol¬ 
lows  that  the  shading  flow  is  sufficent  (for  these  examples) 
to  derive  the  shape. 
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Figure  4.  Color  images  of  natural  objects  are  piecewise  smooth  and  the  hue  fbw  captures  this.  (A)  An  apple  with  varying  hue.  (B)  A 
representation  of  hue  as  a  scalar  fi  eld,  with  value  corresponding  to  height.  (C)  The  hue  fi  eld,  with  each  value  represented  as  a  vector  pointing 
to  location  on  the  hue  circle.  (D)  The  geometry  of  the  hue  fbw,  illustrating  that  nearby  values  can  be  represented  as  a  differentiable  frame 
fi  eld  that  is  tangent  (and  normal)  to  the  streamlines  of  the  fbw.  Interations  between  nearby  hue  values  then  correspond  to  an  (infi  nitesimal) 
transport  of  the  frame  in  direction  V,  which  rotates  it  according  to  the  connection  form  of  the  frame  fi  eld.  Since  Et:En  are  unit  length, 
their  covariant  derivative  lies  in  a  normal  direction,  regardless  of  V .  This  diagram  also  suggests  a  relationship  between  hue  and  texture  and 
shading  fbws. 


Figure  5.  Illustration  of  the  different  types  of  compatibility  fi  elds  that  can  be  used  for  early  forms  of  good  continuation.  In  each  case 
the  central  unit  is  supported  by  the  contextual  arrangement  of  surrounding  units,  and  can  be  used  as  the  constraints  within  quadratic 
programming,  relaxation  labeling,  and  belief  propagation  engines,  (top)  For  boundary  continuation,  the  orientation  at  a  position  is  enhanced 
by  consistent  tangential  (co-circular)  boundary  measurements  at  nearby  positions  [  :4,  14]  (middle)  For  oriented  texture  measurements, 
both  tangential  and  normal  curvatures  arise.  Similar  models  can  be  used  for  shading  fbws,  which  are  the  tangent  fi  elds  to  the  intensity 
level  sets  [8].  (bottom)  For  hue  fbws  the  orientations  are  replaced  by  colors.  In  the  fi  rst  column  zero  curvature  continuations  are  shown. 
In  the  last  column,  a  single  large  curvature  is  shown.  For  the  texture  and  hue  compatibilities,  the  tangential  curvature  is  zero  and  the  normal 
curvature  is  not.  Note  the  emergence  of  singularities. 


The  shading  flow  is  estimated  by  evaluating  a  gradient 
operator  (an  orientationally-selective  receptive  field  tuned 
to  low  spatial  frequency)  over  the  image.  It  demonstrates 
one  role  for  the  long-range  interactions:  correcting  local  ar¬ 
tifacts  in  shading  flow  estimation. 

Our  next  examples  (Fig.  7)  on  artificial  images  con¬ 
firm  the  classical  view  that  color  remains  invariant  across 
shadows  while  shading  effects  surface  percepts  [25].  This 
is  most  clear  in  the  plastic  sphere,  and  the  same  effect 
is  reproduced  in  the  Google  logo,  which  appears  both 
3 -dimensional  and  colored.  However,  unlike  the  plastic 
sphere,  there  are  no  mutual  illumination  effects. 

The  next  examples  show  how  hue  can  vary  over  a  natural 
object.  Fig.  4  shows  the  hue  flow  for  an  apple,  and  Fig.  8 


is  a  close-up  of  a  woman’s  face  in  which  a  blush  has  been 
introduced.  Note  in  particular  how  variant  the  “color”  is,  a 
point  of  some  relevance  to  both  face  identification  and  emo¬ 
tional  estimation.  Hue  can  also  vary  systematically  over  a 
scene.  Atmospheric  depth  scattering  is  shown  in  Fig.  9. 

Our  next  two  examples  illustrate  the  beautiful  complex¬ 
ity  of  shading,  hue,  and  boundary  interactions.  The  first 
shows  an  apple  photographed  on  a  highly  reflective  surface 
in  bright  sunlight  (Fig.  10).  The  flows  are  varied  with  re¬ 
spect  to  one  another  and  with  respect  to  the  boundaries  (of 
both  the  apple  and  the  shadow).  In  particular,  the  mutual  il¬ 
lumination  modulating  the  shadow  [20]  introduces  a  smooth 
shading  flow  not  unlike  the  one  for  the  plastic  sphere  or  the 
Kingdom  examples  but  this  time  due  to  a  lighting  effect  and 
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Figure  6.  Results  on  the  test  Kingdom  images.  Note  how  both 
provide  the  impression  of  an  undulating  surface  with  color  on  it. 
The  left  column  is  Kingdom  Fig.  2d;  the  right  column  is  Kingdom 
Fig.  2c.  From  top  to  bottom  are  original  images;  initial  estimate 
of  shading  fbw  (tangents  to  intensity  level  sets);  fi  nal  estimate  of 
shading  fbw;  initial  estimate  of  hue  flaw;  fi  nal  estimate  of  hue 
fbw.  The  shading  flaw  corresponds  to  the  undulations;  the  hue 
fbws  are  smooth  and  do  not  interfere  with  them. 


not  a  surface  normal  effect.  The  mutual  illumination  effect 
is  also  strong  on  the  bananas  image  (Fig.  11),  which  also 
illustrates  a  shading  flow  effect  due  to  a  highly  diffuse  cast 
shadow.  In  this  case  the  cast  shadow  phenomenon  is  readily 
identified,  because  the  hue  flow  is  constant  across  it. 

Our  final  example  (Fig.  12)  illustrates  the  complement 
to  shading  and  hue;  notice  how  the  hue  remains  invariant 
through  the  highlight,  even  though  it  is  a  complex  pattern 
for  the  pepper. 

5.  Summary  and  Conclusions 

Perceptual  organization  was  viewed  within  Gestalt  psy¬ 
chology  as  pervasive  in  perception,  but  discussion  of  such 
issues  in  computer  vision  is  significantly  more  limited.  Our 
goal  in  this  paper  was  to  take  a  step  back  and  raise  the  pro¬ 
file  of  questions  for  which  P.O.  is  relevant.  Following  a 
biological  analogy,  we  introduced  the  construct  of  multiple 
(spatially)  aligned  flows  within  which  Gestalt  good  contin¬ 
uation  can  be  enforced  geometrically  but  between  which 
information  can  be  inferred  about  the  many  complexities 
of  lighting,  space,  and  geometry.  The  computation  of  each 


Figure  7.  Shading  and  hue  flaws  for  artifi  cial  objects.  Although 
the  shading  flaw  fi  elds  are  not  shown,  notice  how  the  hue  fbws 
(superimposed  on  the  original  image)  are  constant  over  the  “plas¬ 
tic”  objects.  This  is  the  way  such  materials  were  designed.  The 
case  of  the  sphere  also  introduces  two  more  complex  lighting  ef¬ 
fects.  First,  note  how  the  hue  flaw  remains  constant  through  the 
shadow.  This  is  a  classical  cue  for  separating  shadow  boundaries 
from  surface  boundaries.  (Surface  boundaries  are  taken  to  involve 
different  materials,  and  therefore  a  hue  discontinuity  together  with 
the  intensity  discontinuity.)  Second,  and  less  familiar,  is  the  mu¬ 
tual  illumination  between  the  sphere  and  the  tabletop,  which  is 
captured  by  the  hue  fbw  but  not  the  shading  fbw.  The  left  mag- 
nifi  cation  shows  the  initial  local  measurements  of  hue;  the  right 
magnifi  cation  shows  the  converged  hue  fbw.  A  boundary  has  been 
introduced  around  the  hue  fbw  on  the  table  top  illustrating  an  elon¬ 
gation  in  the  direction  of  the  source. 


Figure  8.  Hue  fbws  vary  for  natural  objects.  This  shows  a  portion 
of  a  woman’s  face  (the  lips  are  lower  left)  when  she  is  blushing 
(blue  vectors)  and  not  blushing  (black  vectors).  Note  how  hue 
varies  both  spatially  and  as  a  function  of  emotional  and  physical 
states. 


flow  was  global,  based  on  local  measurements  and  differen¬ 
tial  (covariant  derivative)  constraints  between  them.  At  the 
same  time  the  computation  of  each  flow  was  local  within  an 
information  (sometimes  within  a  sensor)  source,  and  logical 
relationships  between  flows  provide  a  new  foundation  for 
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Figure  9.  Hue  fbws  and  atmospheric  depth  effects.  The  fbw  is 
shown  along  a  thin  strip  on  the  right  side  of  the  photograph.  Note 
the  dominant  shift  toward  blue  for  the  upper  half. 


Figure  10.  An  image  of  an  apple  on  colored  cardboard  in  bright 
sunlight.  It  illustrates  the  complexities  that  can  arise  both  for  shad¬ 
ing  due  to  surface  irregularities  from  packing  and  from  mutual  il¬ 
lumination.  In  particular,  the  shaded  area  now  exhibits  a  shading 
fbw  derived  from  mutual  illumination,  in  which  the  gradient  de¬ 
creases  in  magnitude  away  from  the  concavity  between  the  apple 
and  the  table.  At  the  same  time,  there  is  strong  mutual  illumina¬ 
tion  between  the  apple  and  the  cardboard  and  the  cardboard  and 
the  apple.  The  result  are  smooth  shading  and  hue  fbws,  with  dis¬ 
continuities  at  neither  object  nor  shadow  edges. 

many  computer  vision  computations.  Hue  flows  smoothly 
through  shadows,  while  intensity  often  jumps.  Shading 
flows  smoothly  over  many  man-made  objects,  while  hue  is 
often  constant.  Natural  objects  often  imply  smooth  shading 
and  hue  flows,  although  they  are  typically  independent  of 
one  another.  The  involvement  of  boundaries  is  both  neces¬ 
sary  and  complicated  [12]. 


Figure  11.  A  photograph  of  bananas  illustrates  the  richness  of  mu¬ 
tual  illumination  in  a  complex  scene.  The  result  is  an  essentially 
constant  hue  fbw  (middle  row,  left:  initial  measurement;  right: 
consistent  flaw).  The  shading  fbw  (bottom)  illustrates  a  special 
interaction  between  boundaries  and  shading  fbws,  in  which  multi¬ 
ple  surface  fold  away  from  each  other  along  them.  Such  situations 
are  geometrically  rare. 

While  the  list  of  interactions  must  be  extended  (motion 
and  stereo  should  at  least  be  included),  it  is  useful  to  con¬ 
clude  on  an  enlargement  of  the  biological  metaphor  under¬ 
lying  this  paper.  The  centrality  of  long-range  horizontal 
connections  as  defining  each  flow  suggests  that  the  flows  be 
layered  on  top  of  one  another,  enabling  “vertical”  connec¬ 
tions  for  their  interactions.  Recent  breakthoughs  in  color 
processing  demonstrate  that  hue  and  orientation  are  not  in¬ 
dependent,  as  was  once  thought,  and  that  such  vertical  con¬ 
nections  exist  [26].  Computationally  it  remains  an  open 
question  whether  only  two  interaction  “dimensions”  suffice. 
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ABSTRACT 

We  describe  signal  processing  tools  to  extract  structure  and  information  from  arbitrary  digital  data  sets.  In 
particular  heterogeneous  multi-sensor  measurements  which  involve  corrupt  data,  either  noisy  or  with  missing 
entries  present  formidable  challenges.  We  sketch  methodologies  for  using  the  network  of  inferences  and  similarities 
between  the  data  points  to  create  robust  nonlinear  estimators  for  missing  or  noisy  entries.  These  methods  enable 
coherent  fusion  of  data  from  a  multiplicity  of  sources,  generalizing  signal  processing  to  a  non  linear  setting.  Since 
they  provide  empirical  data  models  they  could  also  potentially  extend  analog  to  digital  conversion  schemes  like 
“sigma  delta” . 

Keywords:  Markov  processes,  multiscale  analysis,  diffusion  on  manifolds,  Laplace-Beltrami  operator. 


1.  FEATURE  BASED  FILTERING,  DIFFUSIONS  AND  SIGNAL  PROCESSING  ON 

GRAPHS 


A  simple  way  to  understand  the  effect  of  introducing  similarity  based  diffusions  on  data1-6  is  provided  by 
considering  a  regular  gray  level  image  in  which  we  associate  with  each  pixel  p  a  vector  i /(p)  of  features.7, 8  For 
example,  a  multi-band  electromagnetic  spectrum  or  the  5x5  sub-image  centered  at  the  pixel,  or  any  combination 
of  features.  Define  a  Markov  filter 


A 


p,q  ~ 


exp_MP±^M£ 

V  pvn  -\\v(p)-y(q)\\2  ’ 

2^q  eXP  e 


(1) 


where  e  >  0  is  a  small  parameter  comparable  to  the  smallest  distances  between  two  feature  vectors  v(p)  and 
v(q).  Clearly  the  map  v  is  a  bijection  between  pixels  in  the  image  and  patches  (or  features).  In  particular  every 
function  on  the  pixels,  such  as  the  original  image  I  itself,  is  also  a  function  on  the  set  of  patches.  With  this 
identification,  one  can  let  the  Markov  filter  Apa  act  on  an  image. 

The  image  I  in  figure  1  was  filtered  using  the  (nonlinear  in  the  features)  procedure  described  above  where 
the  feature  vector  v(jp)  is  the  5x5  patch  around  a  pixel  p: 


up )  =  =  E 

«  q 


r,yr,  IHP)-Kg)ll2 

_ i _ 1(a) 

V  nyn  -IKp)-Rg)ll2 

2^q  eXP  e 


(2) 


Observe  that  the  edges  are  well  preserved  as  patches  translated  parallel  to  an  edge  are  similar  and  contribute 
more  to  the  averaging  procedure.7,8  We  should  also  observe  that  if  we  were  to  repeat  the  procedure  on  the 
filtered  image  we  would  get  a  numerical  implementation  of  various  nonlinear  heat  diffusions  for  image  processing 
similar  to  those  in  PDE  methods,  such  as  those  by  Osher  and  Rudin. 

It  is  useful  to  replace  A  by  a  bi-Markovian  version  of  the  form 

_  exp 

Ap,q  u(p)u(q) 
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Figure  1.  Left:  original  noisy  image.  Right:  image  denoised  by  application  of  the  Markov  matrix  as  in  (1) 


Figure  2.  Left:  original  noisy  image.  Right:  image  denoised  by  application  of  the  Markov  matrix  as  in  (1),  but  where 
features  are  local  variances  rather  than  pixel  values  in  a  patch  around  each  pixel. 

where  the  weights  cj(-)  are  selected  so  that  A  is  Markov  in  p  and  q. 

The  noisy  IR  image  in  Figure  2  was  filtered  by  N.  Coult  using  a  vector  of  25  statistical  features  associated 
with  each  pixel. 

The  Markov  matrix  used  for  filtering,  defines  a  diffusion  on  the  graph  of  patches  or  features  viewed  as  a  subset 
of  25  dimensional  Euclidean  space.  The  eigenvectors  of  this  diffusion  permit  us  to  compute  all  of  its  powers  and 
to  define  a  diffusion  geometry  and  signal  processing  on  this  “image  graph”.7 

For  the  next  example  consider  3  noisy  sensors  measuring  the  x,  y ,  z  coordinates  of  a  trajectory  in  three 
dimensions.  We  could  try  to  denoise  each  coordinate  separately.  Or  use  the  position  vector  as  as  a  feature  vector 
as  we  did  for  the  images  above.  See  Figure  1. 

The  construction  above  should  be  viewed  as  signal  processing  on  the  data  graph.  We  view  all  points  of  the 
trajectory  as  a  data  graph, ie  data  points  p  and  q  are  vertices  and  Ava  is  the  weight  of  the  edge  connecting  them 
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Figure  4.  Left:  standard  position  of  electrodes  in  EEG.  Middle:  diffusion  map  of  the  responses  to  4  electrodes,  showing 
the  nonlinear  correlations  and  manifold-like  structure  of  these  responses.  Right:  diffusion  map  of  the  responses  to  all 
electrodes,  exhibiting  similar  nonlinear  correlations.  In  fact,  the  manifold  structure  obtained  from  measuring  from  all 
electrodes  is  very  close  to  that  obtained  from  4  electrodes,  suggesting  that  exploiting  the  nonlinear  correlations  would 
allow  to  use  only  4  electrodes. 


and  define  the  diffusion  map  at  time  t  into  m  dimensional  Euclidean  space  by 

~  *<£{XP)  :=  (\\<pi(Xp),  A|(XP), . . . ,  Xtmipm(Xp))  (3) 

For  a  given  t  we  determine  m  so  that  A*)l+1  is  negligible.  The  diffusion  distance1  at  time  t  between  Xp  1  and 
Xq  is  given  as 

d2t(p, q)  =  AP)P  +  -  2 Ap,q  =  £  XfiMXp)  -  <Pi(Xg))2  =  \\$%(Xp)  -  (Xq)\\2  . 

I 

This  map  enables  us  to  represent  geometrically  an  abstract  set  of  measurements  on  a  sensor  array  (measure¬ 
ment  space)  as  we  illustrate  on  the  following  EEG  example.11 

The  20  electrodes  measure  coherent  electrical  activity  in  the  brain.  Mapping  the  configuration  space  of 
the  measurements  of  4  electrodes  leads  to  the  same  configuration  as  for  all  20.  In  the  linear  case  this  will  be 
obtained  by  de-correlating  the  outputs  ,  here  however  different  locations  of  sources  result  in  a  different  attenuation 
vectors  ,or  linear  de-correlations.  Here  the  first  three  nontrivial  eigenvectors  are  used  to  map  the  data  to  three 
dimensions  (diffusion  map),  see  Figure  4.  The  implications  are  obvious  4  electrodes  suffice  to  get  essentially  the 
same  measurements  ,  redundancy  is  useful  to  obtain  a  clean  version.11 

3.  MULTISCALE  STRUCTURES  AND  THE  EMERGENCE  OF  ABSTRACT 

SENSOR  FEATURES 

It  is  possible  to  build  a  multiscale  decomposition  of  a  data  graph  simply  by  organizing  the  data  into  affinity 
folders  where  the  affinity  is  measured  through  the  diffusion  distance  at  different  time  scales  A  simple  algorithm9 
is  obtained  as  follows  Let  xl+l  be  a  maximal  sub-collection  of  points  in  {xj}  (key-points  at  scale  1)  such  that 

dtl  (Xj  1,  x1^1)  >  where  x9  are  the  original  points,  and  ti  =  a2z,  l  =  0, 1,  2, - Then  clearly  each  point  is  at 

distance  less  than  a  half  at  scale  l  from  one  of  the  selected  key-points  allowing  us  to  create  a  folder  labeled  by 
the  key-point.  It  is  easy  to  modify  to  obtain  a  tree  of  disjoint  folders  by  viewing  each  key  point  as  the  folder  of 
points  nearest  to  it,  and  reinterpret  the  distance  as  distance  between  folders. 

When  applied  to  text  documents  (equipped  with  semantic  coordinates),  this  construction  builds  an  automatic 
folder  structure  with  corresponding  keywords  characterizing  the  folders.4, 7  While  for  text  documents  folders 
are  just  collection  of  related  documents,  and  abstractions  are  collection  of  words  in  a  given  class,  the  situation  is 
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where  cpa  is  a  (e.g.  wavelet)  basis  on  Q,  and  (fp(r)  is  a  (wavelet)  basis  on  R.  In  the  formula  above 


<W  =  ^2d(q,r)<pa(q)<pp(r) , 

q,r 


where  we  accept  this  sum  (as  validated)  only  if  various  randomized  averages  using  subsamples  of  our  data  lead 
to  the  same  value  of  Sajp.  In  the  calculation  of  D  we  only  use  accepted  estimates  for  Sa,{3- 

The  wavelet  basis  can  of  course  be  replaced  by  tensor  products  of  scaling  functions  or  any  other  approximation 
method  in  the  tensor  product  space,  including  other  pairs  of  bases,  one  for  q  the  other  for  r,  including  graph 
Laplacian  eigenfunctions  (we  observe  in  passing  that  the  singular  value  decomposition  is  a  particular  case  of  this 
construction  ) .  A  direct  method  for  filtering  d  or  estimating  D  without  the  need  to  build  basis  functions  can  be 
implemented  as  at  the  beginning  of  this  paper. 

Define  a  Markov  matrix  A  =  a[(r,q),(rf ,qf)]  (corresponding  to  diffusion  on  Q  x  R)  as 


a[(r,g),(r',g')]  = 


exp 


e 


u(r)  —  v{r') 


+ 


ID(g)-M(g/)H2 


Er,g  eXP  ( 


| \v(r)-v(r') 


+ 


I MgWdgO 


(5) 


Where  the  vector  v(r)  is  response  column  vector  corresponding  to  the  column  r,  and  p(r)  is  a  sensor  row  vector. 

The  parameters  epsilon,  delta  are  chosen  after  randomized  validation  as  described  above.  We  can  have  an 
alternate  definition  of  D  as  follows. 


D(r,  q)  =  ^2  aI(r>  (r'>  q)  ■ 

r,q 


Observe  that  the  distances  occurring  in  the  exponent  can  be  replaced  by  any  convenient  notion  of  distance  or 
dissimilarities,  and  that  any  polynomial  in  A  can  be  used  to  obtain  a  better  filtering  operation  on  the  raw  data. 

A  new  combined  graph  can  also  be  formed  by  embedding  the  graph  Q  x  R  into  Euclidean  space  ,say  by  the 
diffusion  embedding  ,  followed  by  an  expansion  of  the  data  d(q,  r )  on  this  new  structure,  or  by  filtering  as  above 
on  the  new  structure. 

5.1.  Markov  Decision  Processes 

In  the  papers14, 15  the  multiscale  analysis  construction  of  diffusion  wavelets  is  applied  to  Markov  Decision  Pro¬ 
cesses.  Informally,  and  in  a  simplified  version,  one  or  more  agents  explore  a  given  state  space  S  by  taking  actions 
in  each  state  from  a  set  of  actions  A,  and  collect  different  rewards  R ,  that  we  assume,  to  simplify  the  presenta¬ 
tion,  to  depend  only  on  the  location  and  not  on  the  action.  Suppose  we  can  model  the  state  space  as  a  finite 
graph  (S',  F,  W)  (the  uncountable  or  continuous  case  can  be  handled  as  well),  with  edges  E  and  weights  W,  and 
that  the  agent  (s)  explore  the  state  space  randomly  accordingly  to  the  Markov  process  Pn,  parametrized  by  a 
(policy)  7 r,  which  maps  each  state  to  a  probability  distribution  of  actions  for  that  state.  The  reward  function 
R  is  a  real-valued  function  on  S.  The  expected  long  term  sum  of  discounted  rewards  when  the  agent  follows 
the  policy  tt  is  a  function  Vn  on  S,  called  (state)  value  function.  It  satisfies  the  so-called  Bellman  equation 
Vn  =  R  +  7 P7ry7r,  q  £  (q, 1]  being  the  discount  factor,  and  hence  V71  =  (I  —  ryP7r)~1R .  In  terms  of  potential 
theory,  (I  —  P77)-1  is  the  Green’s  function  (or  fundamental  matrix)  of  the  “Laplacian”  I  —  Pn ,  and  Vn  is  the 
potential  generated  by  the  “charge”  R  under  the  diffusion  P77.  Suppose  for  simplicity  that  P n  is  reversible:  it 
is  then  similar  to  a  symmetric  matrix  Tn  that  generates  a  Markov  diffusion  semigroup  {(T7r)t}.  The  diffusion 
multiscale  analysis  allows  to  efficiently  compute  (P7r)t(x,  y)  for  arbitrary  £,  medium  and  large,  for  one  or  multiple 
agents;  it  allows  to  effectively  approximate  the  value  function  E7r,  which  is  often  piecewise  smooth,  performing 
a  very  useful  dimensionality  reduction,14  where  ad  hoc  basis  functions  were  previously  constructed  by  hand  and 
were  only  available  in  particularly  simple  geometries.  Finally,  it  allows  to  solve  Bellman’s  equation  directly,  to 
high  precision,  in  an  efficient  way.  In15  this  method  is  compared  with  classical  direct  methods  (often  unfeasible 
because  of  their  computational  complexity  of  0(|P|3)),  and  with  optimized  iterative  solvers. 


Figure  7.  Left:  continuous  state  space  for  a  MDP,  the  actions  are  movements  in  the  four  cardinal  directions,  blue  points 
represent  positive  rewards.  Right:  after  a  random  exploration  by  the  agent,  multiscale  bases  functions  are  constructed 
on  the  state  space:  the  color  is  proportional  to  the  value  of  various  scaling  functions,  which  are  automatically  adapted 
to  the  state  space.  The  value  function  can  be  projected  onto  this  basis,  in  fact  if  the  value  function  is  piecewise  smooth, 
only  few  elements  of  the  basis  (a  number  independent  of  the  number  of  samples!)  will  be  required  to  approximate  the 
value  function  to  a  given  precision. 

6.  CONCLUSIONS  AND  DISCUSSION 

It  is  quite  clear  from  the  preceding  descriptions  that  the  data  graph  can  be  equipped  with  informative  geometric 
structures  which  coherently  integrate  data  and  enable  inference  and  interpolation.  One  of  our  main  goals  is  to 
efficiently  regress  empirical  functions  on  a  data  set,  we  have  indicated  various  methods  to  build  and  approxi¬ 
mate  empirical  functions,  admitting  natural  extensions  (generalization)  off  the  known  measured  data.  We  also 
indicated  that  signal  processing  on  data  could  be  achieved  without  any  knowledge  of  the  data  model,  by  letting 
the  intrinsic  data  geometry  emerge  through  a  natural  process  of  affinity  diffusion.  Modern  sensor  systems  such 
as  radar,  hyperspectral,  MRI  and  others  actually  do  not  measure  images  but  much  more  elaborate  vectors,  the 
images  are  built  to  allow  understanding  and  further  processing,  in  reality  we  should  let  the  intrinsic  geometry  of 
the  measurements  participate  in  the  information  extraction.  Such  an  approach  has  been  developed  by  our  team 
for  hyperspectral  imaging. 

We  also  observe  that  in  the  context  of  compressed  sensing  where  the  sensor  inputs  are  randomly  encoded. 
The  projection  into  a  random  coded  subspace  while  maintaining  the  relative  affinity  of  the  original  data  points 
permits  rebuilding  the  data  geometry  by  tools  described  above. 
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Abstract 

Belief  propagation  has  been  shown  to  he  a  powerful  in¬ 
ference  mechanism  for  stereo  correspondence.  However  the 
classical  formulation  of  belief  propagation  implicitly  tin- 
poses  the  frontal  parallel  plane  assumption  in  the  compat¬ 
ibility  matrix  for  exploding  contextual  information,  since 
the  priors  perfer  no  depth  {disparity)  change  in  surrounding 
neighborhoods.  This  results  in  systematic  errors  for  slanted 
or  Curved  surfaces.  To  eliminate  these  errors  w  propose 
to  use  contextual  info  mi  at  ion  geometrically  and  show  how 
to  encode  surface  differential  geometric  properties  in  the 
compatibility  matrix  for  stereo  correspondence.  This  en¬ 
forces  consistency  for  both  depth  and  surface  normal,  ex¬ 
tending  the  traditional  formulation  beyond  consistency  for 
{constant}  depth.  With  such  geometric  contextual  informa¬ 
tion,  the  belief  propagation  algorithm  shows  dramatic  im¬ 
provement  on  generic  non -frontal  parallel  scenes.  Several 
such  e.xampies  are  provided. 


1.  Introduction 

In  recent  years  both  belief  propagation  [28.  29  J  and  graph 
cuts[5H  17h  IS]  have  been  proposed  to  solve  the  stereo  cor¬ 
respondence  problem.  They  have  achieved  .great  success 
as  their  variants  keep  topping  the  comparison  chart  for  the 
Middfebury  dataset  [29*  27].  However  a  closer  look  at  these 
image  pairs  indicates  that  they  are  quite  limited  in  terms 
of  the  surface  types  represented-  The  ground  truth  dispar¬ 
ity  shows  that  every  object  has  very  limited  {or  no)  dis¬ 
parity  change,  indicating  every  single  object  can  be  weEl 
described  by  a  {combination  of)  frontal  parallel  planets). 
Most  importantly  this  limitation  has  been  exploited  algo¬ 
rithmically,  When  using  contextual  information,  both  be¬ 
lief  ptopagafiofi[23,  291  and  graph  cnts[5n  17]  use  either  the 
Potts  energy  model  or  the  truncated  linear  (quadratic)  en¬ 
ergy  model*  which  implicitly  use  the  frontal  parallel  plane 
assumption:  namely  that,  within  a  region  under  considera¬ 


tion,  position  disparity  (or  depth)  is  constant  with  respect  to 
the  rectified  stereo  image  pair. 

However  the  real  world  consists  of  objects  of  much 
richer  geometry  than  frontal  parallel  planes.  Can  these  al¬ 
gorithms  handle  slanted  or  curved  surfaces?  Wc  show  that 
current  formulations,  cannot  handle  such  scenes  well.  Our 
goal  in  this  paper  is  to  move  beyond  those  limitations  for 
the  belief  propagation  algorithm.  Others;  have  attempted 
more  general  surface,  types.  Slanted  planar  surfaces-  are  ex¬ 
plicitly  modeled  for  segmented  regions  in  [3],  where  seg¬ 
mentation  and  correspondence  are  iteratively  obtained  from 
the  multi wray-cut  algorithm  [5];  this  has  been  generalized  to 
curved  surfaces  1 23].  In  [24]  a  slanted  seanline  algorithm 
is  developed  for  slanled  planar  surfaces.  To  our  knowledge 
there  are  no  other  attempts  at  using  surface  differential  ge¬ 
ometry  directly  for  general,  smooth  surfaces, 


□  Q 


Figure  L  3D  reconstruction  of  ii  face  from  a  pair  of  .stereo  im¬ 
age!?.  Our  algorithm  can  achieve  accurate  geometric  imdeLing  of 
such  sttuhuIi  curved  surfaces.  Since  surface  normals  are  changing 
smoothly  almost  everywhere,  it  follows  that  the  tangent  planes  are 
mostly  not  hemal  parallel  planes.  Therefore  it  is  necessary  to  de¬ 
velop  richer  geometry  in  computing  the  compatibilities  in  belief 
propagation  than  what  has  been  cumenlly  used, 

Using  belief  propagation  as  our  algorithmic  framework, 
we  argue  that  for  general  surface  types  differential  geo- 
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metric  properties  have  to  be  taken  into  consideration  when 
using  contextual  information.  In  particular,  we  derive 
the  compatibilities  using  both  position  disparity  (or  depth) 
and  disparity  derivatives  (surface  normal)  for  general  pla¬ 
nar  and  smooth  curved  surfaces  and  illustrate  their  use  on 
scenes  with  slanted  surfaces  and  faces.  Comparison  results 
with  the  traditional  formulation  of  compatibilities  [29,  28] 
clearly  demonstrate  the  importance  of  using  both  positional 
(zeroth -cirder)  disparity  (or  depth)  and  surface,  normal  (first- 
order  disparities).  Fig.,  1  is  a  preview  of  our  face  reconstruc¬ 
tion  example  (Fig.  8).  Such  accurate  surface  reconstructions 
arc  necessary  for  applications  in  3D  modeling,  computer 
graphics*  facial  expression  recognition,  and  surgical  plan¬ 
ning. 

2.  Problem  Formulation 

Wi  th  a  rectified  stereo  pair  [9, 13jf  stereo  correspondence 
can  be  formulated  as  the  estimation  of  a  random  variable 
Xk,  fur  every  node  k  in  a  Markov  Random  Field  (MRF), 
with  Xk  the  (positional)  disparity  at  hr  (Notation  follows 
[11.  29]).  Further  assuming  a  pair-wise  MRF,  let  '^(xiYXj) 
denote  the  compatibility  function*  which  encodes  the  com¬ 
patibility  between  two  immediate  neighboring  nodes  i  and 
j*  and  Jjrjfc)  (also  shortened  as  ^(2^))  denote  local  ev¬ 
idence  that  variable  ar*.  is  consistent  with  observation  yk. 
Then  the  joint  probability  of  this  MRF  is: 

P{xux  a,...»arjv,Sh,&2,.'.vftv)  (I) 

=  n  ^(^1^)11 

(id)  k 

Different  criteria  in  optimizing  eq.  ( 1 )  give  different  mes¬ 
sage  passing  rules  for  belief  propagation.  In  particular,  the 
Maximum  A  Posterior  (MAP)  estimator  gives  us  the  max- 
product  algorithm.  Specifically;  messages  at  iteration  t  +  1 
are  given  by(28,  1 1  ]: 


Alternatively  one  can  obtain  the  Minimum  Mean 
Squared  Error  (MMSE)  estimate  [11),  which  requires  to 
change  max*,  in  eq,  (2)  to  ,  (thus  called  the  sum- 
product  algorithm),  and  accordingly  compute  the  solution 
as  xmmse = 

For  direct  comparison  with  other  belief  propagation 
stereo  algorithms  [29,  28],  we  use  the  max-product  mes¬ 
sage  updating  rule  to  find  die  MAP  estimate.  It  has  been 
shown[29]  that  this  is  the  -same  problem  that  Graph  Cute  al¬ 
gorithms  have  been  solving  [5],  Our  focus  in  this  paper  is  to 
develop  the  compatibilities  geometrically,  however  efficient 
techniques!  IQ]  could  be  used  to  speed  up  our  implementa¬ 
tion. 


2*L  Traditional  Compatibility  4^- 

Thc  most  common  models  for  ^ )  both  prefer  no 

disparity  change  between  two  neighboring  node=s. 


2.1.1  Potts  Energy  Model  Derived  'F^ 
First,  the  compatibilities  in  [29]  are  defined  as: 


*5nts'(xi  (arj)-®p(- 


V(xi,Xj) 


<fv 


} 


which  is  derived  from  the  Potts  energy  model  with 


V(x 


i,Xj)={ 


Q  if  Xi  =  Xj 

pj(|A  -  I j\]  otherwise 


where 


pii iJi  -  h\)  =  {  x  Al 


if 'jfi  —  Ij\  <  T 

Otherwise 


(5) 


with  T  a  threshold,  At  a  penalty  term  for  having  different 
disparities*  and  C 1  a  penalty  term  that  increases  the  penalty 
when  (he  image  intensity  difference  is  small.  T,  Ai,  and 
Ci  are  constant  parameters.  The  Potts  model  assumes  that 
labelings  should  be  piecewise  constant. 


1  ( Xj )  <-  a  max  Vy  {i<,  )  *  (ar* }  ]^[  ml,  (a,) 

*eJV((jy 

(2) 

where  m[7'  is  the  message  that  node  i  sends  to  nmie  j  at 
iteration  £  + 1,  N[i)  \j  is  the  set  of  nodes  neighboring  node 
i  except  node  j  itself,  and  a  te  a  normalization  term. 

The  belief  fr*  at  node  i  is  then  computed  as: 

bi(xi)  t-  J]  Tn*j(z()  (3) 

fcew(i) 

The  MAP  solution  at  node  i  te: 

x\iap  _  ^  am*  (4) 

Sit |  dir...,dL  } 


2.1+2  Truncated  Linear  Energy  Model  Derived  5'^ 

Second,  the  compatibilities  can  be  derived  from  the  trun¬ 
cated  linear  energy  model,  where  the  cost  increases  lin¬ 
early  as  a  function  of  the  difference  between  two  labels,  up 
to  a  threshold  and  then  stays  as  a  constant  V(xi?x2)  = 
mmfAakfi  -  with  Ag  and  constant  parameters. 

For  stcrco.f28]  has  derived  the  compatibilities  from  such 
a  truncated  linear  potential  function  using  the  Total  Variance 
(TV)  model: 

'bJJL(xilxi)  -  {l.-€p).e»j9(-l^— ^i)  +  e,,  (6) 


i 
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with  a  small  constant  and  a  constant.  €if  and  to¬ 
gether  control  the  shape  of  the  robust  potential  function. 

Clearly  both  of  these  compatibility  mairiccs  prefer  no 
disparity  change  between  two  neighboring  nodes,  i.c.  they 
implicitly  use  the  frontal  parallel  plane  assumption, 

3.  Differentia]  Geometric  Constraints  in  Belief 
Propagation 

We  now  extend  the  compatibilities  to  include  surface  dif- 
ferentiat  geometric  constraints.  As  pointed  out  by  |4,  22], 
□sing  finite  differences  one  has  to  consider  cliques  of  three 
neighboring  nodes  in  the  compatibility  matrix  for  the  plate 
model,  which  accounts  for  an  underlying  slanted  planar  sur¬ 
face  model,  But  such  an  endeavor  quickly  makes  the  prob^ 
lem  computationally  infeasible. 

To  use  differentia]  properties,  at  first  consideration  the 
computational  complexity  issues  would  seem  to  multiply. 
Not  only  are  many  disparity  (depth)  labels  required!  but 
derivatives  must  be  labels  as  well.  In  this  section  wc  show- 
how  to  attach  differential  properties  to  what  we  describe 
as  “floating1'  disparities  (labels),  such  rhar  the  problem  for¬ 
mulation  remains  the  same  (i.e.  pair-wise  MRF),  but  great 
improvements  can  he  achieved  by  considering  higher-order 
differential  geometric  constraints  for  surface  smoothness. 

3.1.  “Floating**  Disparities 

We  now  describe  a  different  interpretation  of  the  classi¬ 
cal  formulation  which  allows  us  to  make  a  small  but  neces¬ 
sary  change  to  the  problem  formulation.  Fig.  2  shows  that 
node  i  is  connected  to  node  j  and  sends  message  mi?  to 
node  j,  Random  variables  fat  node  i)  and  Xj  fat  node  j) 
take  values  from  L  discretized  disparities  f  di  „  d2i . . .  .d^  } . 
The  connection  (edge)  between  two  labels  (di  and  dj)  con¬ 
tributes  to  the  whole  message  niij  from  node  i  to  node  j. 


Figure  2,  3  D  illustration  of  message  passing.  Nodes  i  and  j  have  L 
possible  labels  {din  - .  -  Tufe  }-  Message  from  node  i  to  node 
j  is  a  vector  encoding  [tie  “support"  each  label  at  j  receives  from 
all  possible  labels  at  i. 

For  detailed  geometric  modeling  tasks  this  is  usually  not 


enough.  Directly  increasing  the  number  of  discretized  dis¬ 
parities  (using  suhpixej  disparity  levels)  quickly  makes  it 
computationally  infeasible,  especially  for  scenes  with  large 
disparity  range.  As  an  alternative  approach,  wc  still  use 
L  disparities,  which  arc  initialized  to  be  integer  dispari¬ 
ties  but  can  then  be  changed  to  continuous  (floating)  dis¬ 
parity  based  on  interpolation,  similarly  to  deformable  mod¬ 
els  1 15]  in  spirit.  In  other  words,  the  locations  of  the  la¬ 
bels  can  be  adapted  according  to  initial  measurements  so 
that  the  initial  lattice  of  disparity  labels  is  adapted  to  the 
scene  structure.  In  particular,  we  compute  a  normalized 
5 SO  score  using  a  deformed  window  approach  as  in  [7] . 
A  direction  set  method [26]  is  used  to  find  the  floating  point 
{  d.  |j| .  }  that  gives  the  best  normalized  SSD  score.  Sup¬ 

pose  d  t  [djtt djfe+i);  then  we  change  the  disparity  label 
structure  to  let  dk  Afloat"  to  d-  In  practice  for  each  node 
i  we  can  perform  such  computations  at  several  local  min¬ 
ima  obtained  from  initial  integer  5SD.  The  result  is  a  de¬ 
formable  disparity  structure  which  encodes  local  measure¬ 
ments  based  on  a  deformed  window  SSD.  Note  that  the  dif¬ 
ferential  properties  (first  order  disparity  derivatives)  arc  also 
encoded  at  each  state  di4*x  node  i.  Also  note  that  time  com¬ 
plexity  in  computing  messages  (eq.(2))  remains  the  same. 

3.2,  Differential  Geometric  Constraints  m  Compat¬ 
ibility  Matrix  '$ij 

To  illustrate  how  to  encode  surface  geometry  in  [he  com¬ 
pulation  of  messages  wc  walk  through  the  specific  compu¬ 
tations  and  point  out  wrhcre  geometry  comes  in.  m  fcq.(2» 
is  the  message  from  node  i  to  node  j.  Tt  is  a  vector  which 
contains  the  "support"  that  individual  labels  in  node  j  re¬ 
ceive  from  all  labels  in  node  i.  For  each  state  in  node  j  it  is 
computed  as: 

mgVj  =  di)  (7) 

<-  a  max  ,  di )  (xt)  J|  {a ) 

Fig.  2  illustrates  m ^  (xj  =  tfe)  and  mtJ  (.r^  =  di ). 

3,2, &  Slanted  Planar  Surfaces 


Figure  3.  ID  illustration:  A  general  planar  surface  model  at  dt 
imposes  CimsLrdunlS  ihat  dy  lies  on  [he  name  plane  (m  solid  Lines) 
and  has  the  same  surface  normal.  Alsu  shown  is  the  Cranial  parallel 
plane  at  dt  (in  doited  lines). 

For  node  i  with  label  dn .  a  general  planar  surface  model 
at  dT  i  s  d (ti.  v )  =  d i + ^  ( u—  uq  )+  ^  (u  —  ua- )+  re  presenting 
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the  fact  that  contextual  information  prefers  a  neighboring 
labe]  dj  to  lie  on  the  same  planar  surface  (possibly  non- 
frontal  parallel}  and  should  have  the  same  surface  normal. 
Fig,  3  shows  this  point.  The  compatibility  using  such  a  co- 
plunar  model  is; 


the  latter  approach  is  the  numerically  stable  one  for -dealing 
with  higher  Older  differentia]  properties.  We  use  this  one  in 
our  computations.  For  detailed  discussions  on  the  geomet¬ 
ric  computation,,  see  [21], 

The  compatibility  using  a  quadratic  approximation  as  the 
local  surface  model  of  a  curved  surface  is; 


(S) 


+  tp)C(l  ■  e*)esp(— ^  ^  )  +  c*) 


where  N  Is  the  surface  normal  in  the  disparity  space  and 

i—Sd  _ii 

can  be  computed  as  N  =  .  '  If  the  cameras 

v'WFhIFf+i 

are  further  assumed  to  be  calibrated,  surface  normals  in  Eu¬ 
clidean  space  N  =  {'  ~*u~7y;l./  can  also  be  used.  It  is  de- 

rived  from  z(u>  v)  =  d^vy  which  leads  to  z#  =  -  j 
and  £y  =  &  the  stereo  baseline,  a  focal 

length  in  pixels  and  /  focal  length  in  physical  unit.  Note 
that  this  compatibility  measure  contains  the  frontal  paral¬ 
lel  plane  model  as  a  special  case..  When  the  underlying 
mode!  at  di  is  frontal  parallel  plane,  ^  =  0P  and 

ihe  above  compatibility  simplifies  to  cq.(6). 


3*2:2  Curved  Surfaces 


d 


figure  4,  1 D  illustration :  A  smooth  curved  surface  model  al  dt  im¬ 
poses  constraints  shat  lies  an  the  same  surface  and  has  surface 
normal  +  V^N^.  Also  shown  is  the  frontal  parallel  plane 
and  the  tangent  plane  at  rin  (in  dotted  I  i  nos  j . 

For  curved  surfaces  not  only  the  depth  (positional  dis¬ 
parity}  changes  in  the  neighborhood,  but  the  surface  nor¬ 
mal  also  changes-  To  explicitly  take  this  into  account  in  the 
compatibilities  we  have  to  consider  the  covariant  derivative 
of  the  surface  normal.  V?.:Nrj;?  which  encodes  the  change 
of  surface  normal  for  a  (displacement)  tangent  vector  v  at 
di-  Fig.  4  illustrates  this  point.  Vt.N^  can  be  computed 
using  the  shape  operator  (or  second  fundamental  form) 
IS.  16,  25,  6|n  for  a  tangent  vector  v  in  the  tangent  plane 
T^(M).  The  computation  will  involve  second  order  dis¬ 
parity  derivatives,  which  can  be  either  estimated  initially 
together  with  {  d,  ^  }n  m  attempted  in  [7],  or  more 
preferably  by  a  local  fitting  procedure  over  the  initial  es¬ 
timates  in  Euclidean  space.  Space  limitation  prevents  us 
from  a  detailed  discussion,  but  our  analysis  indicates  that 


((1  -ep)eip(-- 
+  fip}((l  -  tN)exp{-'- 


(Zj  =  di,T j  =  dj)  =  (9) 

I4r  “  ~  ~  Ui)  ~  ~  ^)l  ) 

UNjj  -Nj,  -V,N<I||JJ 


-)  +  ejv) 


Note  that  when  the  underlying  surface  is  a  planar  surface 
V*.N^  =  0T  the  above  compatibility  simplifies  to  eq.  (3). 


4*  Experimental  Results 

In  this  section  we  describe  representative  experimental 
results  on  various  scenes,  with  some  quantitative  error  anal¬ 
ysis.  In  particular  we  compare  results  using  our  formulation 
of  compatibities  with  using  compatibilities  that  have  been 
used  in  the  literature  [29,  23]  h  which  demonstrates  that  our 
new  formulation  indeed  improves  the  performance  of  be¬ 
lief  propagation  algorithm  on  generic  non-Frontal  parallel 
scenes. 

Local  evidence  dF(x*)  is  computed  using  similar  formula 
as  in  [29]; 


*w)  =  - - - ) 

with  the  truncated  data  cost  Di(xi)  p 

min(  |  It  ( u,  v)  -  IT  (u  -  x* ,  v)  | ,  C)  at  node  t  (i.e .  pixel  {u ,  v) ) 
with  disparity  xir  as  in  [29,  10],  For  results  with  traditional 
compatibilities  (eq.  (5)  and  eq.  ('&)),  the  Birchfield-Tomasi 
technique  |2|  is  used  in  computing  (he  image  intensity 
difference  in  the  above  data  cost,  the  same  as  in  1 29,  23], 
We  use  C  =  60  and  Ud  =  50  in  the  experiments- 

The  first  example  (Fig.  5)  is  the  synthetic  "Corridor" 
pair  [12]  with  ground  truth  (courtesy  of  University  of 
Bonn),  Tlie  image  size  is  256x256  pixels  with  a  disparity 
range  of  11  pixels.  The  underlying  scene  structure  consists 
of  slanted  planar  surfaces.  Fig. 5(c)  shows  the  ground  truth 
disparity  map.  Fig.  5(d-i)  shows  the  results  using  different 
compatibilities.  In  particular.  Fig .5(d)  is  the  result  using 
Potts  energy  model  derived  compatibilities  (eq.  (5)),  with 
sum-product  message  updating  and  the  MM  SB  solution; 
Fig. 5(e)  is  with  Truncated  Linear  energy  model  derived 
compatibilities  (eq.  (6))  with  sum-product  message  updat 
ing  and  the  MM$H  solution;  Fig, 5(f)  is  with  Potts  energy 
model  derived  compatibilities  (eq.  (5)},  with  max-product 
message  updating  and  the  MAP  solution  (the  same  as  In 
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(fi)  Putts,  Mas-prod.  .  M  AF+subpixel  <h)  Trunc.  Lin. ,  Max-prod.  MAP  esl,  (i)  Our  coinpat. ,  Max -prod.,  MAP  esl. 

Figuw  5.  i -Li  lib)  Left  [reference)  and  righl  images.  (c>  Ground  truth  disparity  map.  {d)  Potts  model  derived  compatibilities  (eq.  (5)), 
SurmprodueL  MMSE  estimate.  (e)  Truncated  linear  energy  model  derived  cornpatibi  Lilies  (eq.  (6)'h  Sum-product.  MMSE  estimate.  (f) 
Potts  model  derived  compatibilities  (eq.  (5))T  Max-producl,  MAP  estimate.  (g\  Ports  model  derived  compatibilities  (eq.  {5)3,  Max -product, 
MAP  cstiroatc-+subpixel  refi  nement.  (h>  Truncated  linear  energy  model  derived  compatibilities  (eq.  (6)),  Max-product.  MAP  estimate. 
CO  ^ur  general  planar  surface  model  derived  compatibilities  (eq-  {'&'))  proposed  in  this  paper.  Max-product  MAP  estimate.  Using  olher 
compatibilities  one  obtains  stepwise  scalloped  patterns  because  of  the  frontal  parallel  plane  assumption.  On  the  other  hand  our  result  has 
gradual  smooth  disparity  change,  indicating  the  correct  reconstruction.  Error  statistics  in  Fig.  6. 


[29]),  a_s  well  m  a  subpixcl  interpolated  version  based  on  the 
SSDoost  of  a  llxll  window  in  Fig  . 5(g);  Fig,5(h)  shows  the 
result  using  Truncated  Linear  energy  derived  compatibilites 
(eq.  (6)),  with  max-product  message  updating  and  the  MAP 
solution  [the  same  as  in  [2S]);  and  finally  in  Fig.5{i)  we 
show  our  result  using  the  general  planar  surface  model  de- 
rived  compatibilities  (eq.  (S))  proposed  in  this  paper,  with 
max-product  message  updating  and  the  MAP  .solution.  We 
choose  a  fixed  set  of  parameters  throughout  the  experimen- 


lal  section.  In  particular,  n ry  =  50,  C\  =  2,  Ai  =  50. 
T  =  4,  and  zp  =  0.0b,  ffp  =  0.6,  fN  =  0,05.  uN  =  0.4. 

In  Fig.6  we  show  the  error  statistics  using  the  taxonomy 
package  [27j.  In  particular  wc  compute  the  percentage  of 
bad  matching  pixels  with  absolute  disparity  error  larger  than 
different  thresholds  ranging  from  0.25-1.50  pixels.  In  our 
result,  we  achieve  better  error  statistics  because  we  explic¬ 
itly  model  3D  surface  geometry.  The  computational  com¬ 
plexity  of  message  updating  with  the  new  compatibilities 
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(d)  Pons,  Max -prod.,  MAP  esL  (ft)  Tome.  Lin.,  Mix-prod.,  MAP  csL.  (f)  Our  compat.,  Max-pnod.,  MAP  est. 

FigLLTc-  7.  0)0)  Left  (reference)  and  right  images,  (c)  Dispariiy  map  using  Truncated  Linear  energy  model  derived  compatibilities  (eq-  (6))? 
sum-prod ljci  mcsHi^c  updating  and  Hie  MMSE  solution,  (d>  Ports  model  derived  compatibilities  (eq-  (5)),  Max-product,  MAP  estimate, 
(e)  Truncated  Linear  energy  model  cornpuribiliticR,  (cq.  (6j),  Max-pmdrict,  MAP  estimate.  (f)  Our  general  planar  surface  mode]  derived 
compatibilities  (eq.  (S))  proposed  in  this  paper.  Max-product,  MAP  estimate.  Once  again  using  other compatibi lilies  one  obtains  stepwise 
walloped  paLLems  because  of  the  frontal  parallel  plane  assumption  being  used-  On  the  other  hand  our  result  has  gradual  smooth  disparity 
change,  indicating  the  cornea  reconstruction. 
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r^j-h-r.  fj,y  Tr.^r^kj  ...rf  L-rti 


Figurt  6.  Error  statistics  for  Corridor  pair;  shown  is  the  percent¬ 
age  of  bad  matching  pixels  using  the  taxonomy  package  [27]  ai  5 
different  thresholds  ranging  from  0.25-1.50  pixels.  Our  resuh  has 
better  error  statistics  for  such  slanted  planar  surfaces  because  our 
compatibilities  explicitly  encode  such  surface  geometry. 


(eq.  (&))  is  the  same  as  with  the  standard  compatibilities 
(eq  (6)).  But  in  practice  the  standard  compatibilities  are 


pre-computed  and  .stored  in  a  lookup  table,  while  the  new 
■compatibilities  are  explicitly  computed  (although  they  can 
similarly  be  stored  in  a  lookup  table).  Thus  our  algorithm's 
running  time  increases  over  the  standard  max-product  be¬ 
lief  propagation  algorithm  with  compatibilities  as  in  eq.  (6). 
For  this  image  pair  wc  observe  an  increase  from  10  sec¬ 
onds  to  1 6  seconds  for  each  iteration  of  message  updating 
on  a  2.4GHz  CPU-  The  proposed  compatibilities  will  also 
require  computational  overheads  to  obtain  the  initial  differ¬ 
ential  properties;  we  use  a  1 1x1 1  deformed  window  (as  in 
[7,  19])  and  have  observed  a  running  time  of  about  S  mil¬ 
liseconds  per  pixel  (node). 

The  second  example  is  the  “Parking  meter"  pair  from  the 
well-known  JISCT  database.  The  image  size  is  256x240 
pixels  with  a  disparity  range  of  IQ  pixels.  Fig.  7  shows  the 
results  using  different  compatibilities.  Specifically*  Fig.7(c) 
shows  the  result  using  the  Truncated  Linear  energy  model 
derived  compatibilities  (eq.  (6))H  with  sum- product  message 
updating  and  the  MMSE  solution;  Fig.  7(d)  is  the  result  us¬ 
ing  the  Potts  energy  model  derived  compatibilities  (eq.  (5)}. 
with  max-product  message  updating  and  the  MAF  souhion 
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(d)  Tram:.  Liu.,  Maa-pn^i.,  MAP  cst.  |e>  Trunc.  Lin.,  Max-prod.,  MAP+subpixel  (0  Our  cgmput.,  Max -prod.,  MAP  est. 
Figure  8,  (a)(b)  Left  [reference)  and  right  images,  (c)  Pott*  madd  derived  compatibilities  (e-q.  (5>X  Max- produce  MAP  estimate,  (d) 
Traneuted  linear  energy  model  derived  compatibilities  I'eq.  (6))s  Max-product,  MAP  estimate.  (d)  Truncated  linear  energy  model  derived 
compatibilities  (eq.  (6)),  Max-product,  MAP  estimate+subpixel  reli  nemenL  (f)  Qur  smooth  curved  surface  model  derived  compatibilities 
feq.  (9)}  proposed  in  this  paper.  Max -product,  MAP  estimate.  As  in  the  previous  examples  using  other  compatibilities  one  obtains  stepwise 
scalloped  pauenis  because  of  (tic  frontal  parallel  plane  assumption.  On  the  other  hand  our  result  has  gradual  smooth  disparity  change, 
indicating  the  correct  reconstruction. 


(the  same  as  in  129[);  and  Fig .7(e)  is  from  Truncated  Linear 
energy  model  dervied  coinpatibilitcs  (cq.  (6>)H  with  max 
product  message  updating  and  the  MAP  solution  (the  same 
os  in  [2$])*  finally  Fig. 7(f)  is  our  result  using  the  general 
planar  surface  model  derived  compatibilities  (cq.  (S))  pro¬ 
posed  in  this  paper,  with  max-product  message  passing  and 
the  MAP  solution.  Running  time  with  (he  proposed  com¬ 
patibilities  is  1 1  seconds  per  iteration  of  message  updating. 

The  third  example  is  a  stereo  pair  of  a  human  Face 
(Fig-8),  Ground  truth  data  (3D  geometry  and  texture 
map!  were  obtained  from  the  CyW.’F-warz1  M  laser  scan¬ 
ner  dataset.  The  true  disparity  map  is  (hen  computed.  The 
stereo  pair  has  a  baseline  of  6cm  and  focal  length  1 143  pix¬ 
els.  The  human  head  ranges  from  26.5cm  to  53.5cm  in  front 
of  the  camera.  The  original  image  size  is  1(124x768  pixels 
bul  then  subsampled  to  256x192  pixels  and  further  cropped 
to  160x192  pixels  with  a  disparity  range  of  31  pixels. 

Fig.8  shows  the  results  using  different  compatibilities,  In 
particular.  Fig.  8(c)  is  from  the  Potts  energy  model  derived 
compatibilities  (cq.  (5)),  with  max-product  message  updat¬ 
ing  and  the  MAP  solution  (the  same  as  in  [29]);  Fig.  8(d)  is 
from  Truncated  Linear  energy  mode]  derived  compatibilities 


Figure  9.  Emx  statistics  for  Face  pair:  shown  is  the  percentage 
of  bud  matching  pixels  using  the  taxonomy  package  [27]  at  ,5  dif¬ 
ferent  thresholds  ranging  from  0. 25-1.50  pixels.  Our  result  has 
better  error  statistics  for  such  smooth  curved  surfaces  because  our 
compatibilities  explicitly  encode  such  surface  geometry. 

(cq.  (6))t  with  max-product  message  updating  and  the  MAP 
solution  (the  same  as  in  [28]) „  as  well  as  a  subpixel  interpo¬ 
lated  version  based  on  the  SSD  cost  of  a  13x11  window  in 
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Fig-8(<ri;  and  finally  in  Fig. 8(f)  is  our  result  using  the  curved 
surface  model  derived  compatibilities  (eq.  (9))  proposed  in 
this  papery  also  with  max-product  message  updating  and  the 
MAP  solution.  Running  time  of  our  algorithm  is  about  AO 
seconds  per  iteration  of  message  updating,  For  this  pair  the 
occlusion  Is  detected  u^ing  a  combi  nation  of  thresholds  on 
the  deformed  S5D  score,  belief',  and  local  evidence.  Simi¬ 
larly  as  in  the  first  example.  Fig.  9  is  [he  error  statistics.  We 
achieve  better  error  statistics  because  we  explicitly  model 
such  3D  curved  surface  geometry. 

Note  that  the  problem  as  formulated  in  Fig.  2  can  also 
be  solved  using  relaxation  labeling  [14],  as  employed  for 
feature-based  stereo  in  [20]  and  for  texture  How  analysis  in 

m. 

5.  Conclusion 

In  this  paper  we  introduce  surface  differential  geometric 
constraints  to  the  belief  propagation  algorithm.  In  partic¬ 
ular,  we  derive  the  compatibilities  using  both  position  dis¬ 
parity  (or  depth)  and  surface  normal  for  slanted  and  curved 
surfaces,  and  illustrate  their  use.  Such  compatibilites  ex¬ 
tend  traditional  belief  propagation  to  handle  generic  surface 
types.  The  result  is  an  improved  belief  propagation  algo¬ 
rithm  that  can  perform  well  on  slanted  or  curved  surfaces. 
Experimental  results  demonstrate  the  importance  of  incor¬ 
porating  surface  differential  geometry  with  powerful  infer¬ 
ence  algorithm. 

A  t’  kn  u  vv  led  gments 

We  thank  Richard  Szeliski,  Daniel  Scbarstem,  and  Mar¬ 
shall  Tappcn  for  helpful  explanations  of  their  software  pack¬ 
ages  [27,  29],  ax  well  as  the  anonymous  reviewers  for  de¬ 
tailed  comments.  Research  supported  by  AFRL,  AFOSR, 
and  DAR.FA. 

Referenees 

f  1  ]  O.  Ben-Shahar  and  S.  W.  Zucker.  The  perceptual  organiza¬ 
tion  of  texture  few:  A  contextual  inference  approach.  IEEE 
Ttoms.  on  FAME  25(4):401-417. 2000, 

[2]  S.  Birchfield  and  C,  TomasL  A  pixel  dissimilarity  measure 
that  is  insensitive  to  image  sampling.  IEEE  Trms.  on  FAME 
20(4}:401-406,  1998. 

[3]  S,  Birchfield  and  C.  Tomasi.  Multiway  cut  for  stereo  and 
motion  with  slanted  surfaces.  In  Froc,  ICC^E  1999. 

[4]  A .  B  lake  and  A .  Zisserman.  VTiwaJ1  R el  on  s  tniL  don.  The  MIT 
Press.  1987. 

[5]  Y  Boykov,  Q.  Veksler,  and  R.  Zahih.  Fast  approximate  en¬ 
ergy  minimization  via  graph  cuts.  IEEE  Tram*.  on  FAME 
23(11):  1222-1239,2001. 

[6]  R.  Cipolla  and  P.  Giblin.  Visual  Morion  of  Curves  and  Sur* 
faces,  Cambridge  Univ.  Press,  2QG0. 


[7J  F.  Devemay  and  O,  D,  Faugeras,  Computing  differential 
properties  of  3-d  shapes  from  stereoscopic  images  wiLhuui 
3-d  models.  In  Prm.  CVPR,  1994. 

1 8]  M,  P.  do  Carmo.  Differential  Geometry  of  Curves  and  Sur¬ 
faces-  Prenliire-HaLI,  Ira:.,  1976. 

[9 1  O.  Faugeras.  Tlinf-ft/jww/  Computer  Vision.  The  MIT 
Pros,  1993. 

[10]  P.  F-  FeizcnsTwalb  and  D.  P.  Huitcnloeher.  Effi  dent  belief 
propagation  for  early  vi  si  on.  In  Pmc.  CVPR ,  2004, 

[Ill  W,  T.  Freeman,  E.  C-  Pasxtor,  and  O. T.  Carmichael.  Learn¬ 
ing  low-level  vision.  UCVr  40f  1  ):25— 47,  2000. 

[1 2}  T.  Frotdinghaus  and  ).  M.  Buhmann.  Regularizing  phase- 
based  stereo,  In  Froc-  cfICPP,  1996. 

Ll3|  R.  Hartley  and  A,  Zisscrman.  Multiple  \*iew  Geometry  in 
Comparer  Vision.  Cambridge  Univ.  Press,  2000- 

1 14-|  R.  Ar  Hummel  and  S-  w.  Zucker.  On  the:  Foundations  of  re¬ 
laxation  1  ubcling  processes.  IEEE  Trans,  on  FAME  50Y.267- 
2S7,  19B3. 

[15]  M.  Kass,  A.  Wilkin,  and  D.  Terzopoulos.  Snakes:  Active 
contour  models.  UCVf  pages  32 1-33 1,  1988. 

[16]  J  I.  Koenderink.  Solid  Shape.  The  MIT  Press,  1990. 

[1 7]  V.  Kolmogorov  and  R.  Zabih.  Computing  visual  correspon¬ 
dence  with  occlusions  using  graph  cuts.  In  PrtMr.  ICCVu 
2001. 

11  ft]  V.  Kolmogorov  and  R.  Zabih.  Mu  iti-catuera  scene  recon¬ 
struction  via  graph  cuts.  In  Prvc.  ECCV,  2002. 

[19]  G.  Li  and  S.  W.  Zucker.  Stereo  [dr  slanted  surfaces:  First  or¬ 
der  disparities  and  normal  consistency,  In  Fme.  EMMCVPR, 
LACS  3757,  2005. 

[20]  G-  Li  and  S.  W.  Zucker.  Contextual  inference  in  contour- 
based  siiereo  correspondence.  LJCV*  in  press*  2006. 

[2 1  ]  G.  Li  and  S.  W.  Zucker.  Di  LTerential  geometric  consistency 
extends  stereo  to  curved  surfaces.  In  Proe.  ECCV1  2006. 

[22]  S.  Z.  Li .  Markov  Random  Field  Modeling  in  Image  Analysis , 
Springe  r-Verlag,  2001 . 

[23]  M.  H.  Lin  and  C.  Tomasi.  Surfaces  with  occlusions  from  lay¬ 
ered  stereo.  IEEE  Trans,  on  FAME  26(8);  1073-1078, 2004. 

[24]  A.  S.  Ogale  and  Y.  Aloimonos.  Stereo  correspondence  with 
slanted  surfaces:  Critical  implications  of  horizontal  slant.  In 
Froc.  CVFR.  2004. 

[25]  B.  O’Neill.  Elementary  Differential  Geometry .  Academic 
Press,  second  edition.  1997. 

[26]  W.  H.  Press,  S.  A.  Teukoisky.  W.  T.  Veiterling.  and  B.  P. 
Flannery.  Numerical  Redpies  in  C.  Cambridge  University 
Press,  second  edition,  1992. 

[27]  D.  Seharstein  and  R.  Szeliski  A  Utxonomy  and  evaluation  of 
dense  two-frame  stereo  correspondence  algorithms-  UCV \ 
47(1/2/3):  7-42, 2002, 

[28]  1.  Sun.  N.-N.  Zheng,  and  H.- Y.  Shum.  Stereo  matching  using 
belief  propagalion.  IEEE  Trans,  on  FAME  2S(7);787-300, 
2 003. 

[29J  M.  F.  Tappen  and  W,  T.  Freeman.  Comparison  of  graph  cuts 
with  belief  propagation  for  stereo,  using  identical  mrf  param¬ 
eters-  In  Proc.  /CCV^Qtft- 


Proceedmgs  of  lh&  2006  IEEE  Computer  Society  Conference  pn  Computer  Vision  and  Fattem  Recognition  (CVPR'06} 
■0-7695-2597^/06  520  00  <E*  2006  IMEE 


„  ® 

Computer 

society 


152 


I 


"The  views  and  conclusions  contained  herein  are  those  of  the  authors  and  should  not  be  interpreted  as  necessarily  representing  the  official  policies  or  endorsements,  either  expressed  or 
implied,  of  AFRL/SNAT  (now  AFRL/RYAT)  or  the  U.S.  Government." 

"This  material  is  based  on  research  sponsored  by  AFRL/SNAT  under  agreement  number  FA8650-05-1-1800  (BAA  04-03-SNK  Amendment  3).  The  U.S.  Government  is  authorized  to 
reproduce  and  distribute  reprints  for  Governmental  purposes  notwithstanding  any  copyright  notation  thereon." 


Final  Presentation  -  Part  la 
Diffusion  Maps  and  Geometric  Harmonics  for  ATlt 

FA8650-05- 1-1800 
23  Nov  2004  -  31  Oct  2007 


Htcven  W.  Ziicker  &;  Ronald  It.  Coifman 

Department  of  Computer  Science  and  Program  in  Applied  Mathematics 

Yale  University 

aThe  presentation  is  ©2007  Steven  W.  Zucker 


153 


Supported  by  :  AFRL 


Thanks  to  : 


Andreas  Glaser. 
Patrick  Huggins. 
Yosi  Keller. 

Gang  Li. 

Edo  Liberty. 
Mauro  Maggioni 


154 


Goals: 

•  ATR  Pattern  Classification  from  Multiple  Data  Sources 

•  Representation  of  Information 

•  Dimensionality-reduction  Framework 

—  Based  on  Diffusion  Maps  (” non-linear  pea”) 

—  Purely  Data  Driven 
—  Reveal  Intrinsic  Data  Geometry 

•  Fusion  of  Data  Sources 

•  Simple  — >  Complex  Features 

•  “Symbolic”  Features 
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Accomplishments: 

•  Dimensionality-reduction  Framework  for  Data  Fusion 

•  Fusion  of  boundary /texture/color  data  for  improved 
segmentation 

•  Fusion  of  Auditory  and  Visual  Data  for  improved  classification 

•  (Fusion  of  left/right  stereo  pairs):  advanced  geometry 

•  (Embeddings  of  Symbolic  Data):  MMPI 
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Overview  of  Talk: 

•  Dimensionality-reduction  Framework 

•  Results:  Fusion  of  Auditory  and  Visual  Data  for  improved 
classification 

•  Example:  Fusion  of  boundary /texture/color  data  for  improved 
segmentation 

•  Overview  of  stereo  system. 

•  Overview  of  color  projections. 

•  Overview  of  MMPI-2  results. 
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OQI - > 


Form  graph 
and  compute 
diffusion  maps 


nni  > 


Form  graph 
and  compute 
diffusion  maps 


ooi - > 


Form  graph 
and  compute 
diffusion  maps 
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Gaussian  kernel  in  dimensionality  reduction  (Short  history) 

The  assumption  that  high  dimensional  data  reside  on  or  near  a  low 
dimensional  manifold  inspired  many  theoretical  and  experimental 
results. 

•  Scholkopf  and  Samola  used  uses  the  gaussian  kernel  with  no 
normalization  for  non-linear  PC  A  . 

•  Belkin  and  Niyogi  normalize  the  gaussian  kernel  to  be  the 
laplacian  of  a  graph  defined  on  the  data. 

•  Coifman  and  Lafon  further  normalize  for  non-uniform  sampling 
from  the  manifold. 
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Given  a  set  of  n  input  vectors  X{  G  K 


d 


1.  K0(i,j)  <-  e 


\Xi—Xj 


2.  p(i)  <—  J2j= i  Ko (hj)  approximates  the  density  at  Xi 

Q  X(i  i)  K°(i^) 

\  iJ)  p(i)p(j ) 

4.  d(i)  <- 

5- *<«>  -  vMtk 

6.  USUT  =  K  (by  SVD  of  K) 

Stages  2  and  3  normalize  for  density;  stages  4  and  5  perform  the  graph  laplacian  normalization.  In 
limit  n  — >  oo ,  and  cr  — *  0 

•  K  converges  to  a  conjugate  to  the  diffusion  operator  A. 

•  The  functions  ipj^(x)  =  uj^(x)  /  u  q(x)  converge  to  the  eigenfunctions  of  A  on  M. 
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The  quantity!)  ?j(.r,v)  is  a 
distance  between  points 
that  measures  the  connectivity 
of  x  and  y  in  the  data. 

More  robust  than  geodesic  distance 


The  maps  <f>  '  give  a  new  representation 
of  the  data  as  points  in  a  Euclidean  spare. 
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Diffusion  Maps  Reveal  “Manifold” 


3gN6b3i>4) 

_  JM3&D3ISD  3f*h 
_^tPD3D5D  3D3t^u  3L3Ii 
I  cj3^3P3D3 D3D3UD  3D3D 
■  3£%P3D3D  3©d3D3D3D3D  jib 
3@D  3D3DQ03D3IiD3ij3P''.l»,ll 

3  050  3D3!-)3D3D3D3ceD:>j[-i,'|J 
S03O3D  3 Dj D 5  ^-'^^-3 P  yjt, 
JO  3D3rv,-'£^?  ~  ”  3  '  S  5 
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Reading  Lips 
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Diffusion  Maps  Reveal  “Manifold” 


Current  Opinion  In  Neurobiology 


©Steven  W.  Zucker 


165 


Digit  Trajectories  over  “Manifold” 


i 

0.9- 
0.8- 
0.7- 
0.6- 
0.5- 
0.4- 
0.3- 
0.2- 
0.1  - 

o7- 

1 
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Visual  Data  Classifier 


•  View  digit  words  as  a  trajectory  in  the  diffusion  space 

•  Word  recognition  is  identifying  trajectories 


•  Classifier:  compare  new  trajectory  to  collection  of  labeled 
(training)  trajectories. 


•  Use  symmetric  Hausdorff  distance  between  two  sets  Ti  and  IU, 
defined  as 


d(ri,r2)  =  max 


max  min  { 

X2 er2  xiGTi 


X\  -  *2  11} 


max  min 

x\ gTi  X2 er2 


(||xi  -  x2 


(1) 


1 

i 
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Visual  Data  Classification 


“0” 

“2” 

“3” 

“4” 

“5” 

“6” 

“8” 

“9” 

zero 

0.90 

0 

0 

0.01 

0 

0 

0.08 

0 

0 

0 

one 

0 

0.99 

0 

0 

0 

0 

0 

0 

0.01 

0 

two 

0.04 

0.01 

0.90 

0.03 

0.02 

0 

0 

0 

0 

0 

three 

0 

0 

0.01 

0.94 

0 

0 

0.01 

0.02 

0.01 

0 

four 

0.01 

0 

0 

0.05 

0.93 

0 

0 

0 

0 

0 

five 

0 

0 

0 

0 

0 

0.81 

0.01 

0.16 

0 

0.01 

six 

0.07 

0 

0 

0.01 

0 

0 

0.87 

0.03 

0.01 

0.01 

seven 

0.03 

0 

0 

0.04 

0 

0.07 

0.05 

0.74 

0.04 

0.02 

eight 

0 

0 

0 

0 

0.02 

0.03 

0 

0.03 

0.75 

0.16 

nine 

0 

0 

0 

0 

0 

0 

0 

0.04 

0.14 

0.82 
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Note  Similarities  between  Six  and  Seven 
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Audio  Data  Classification  (n  =  10) 


“0” 

“2” 

“3” 

“4” 

“5” 

“6” 

“8” 

“9” 

zero 

0.75 

0 

0.04 

0 

0.01 

0.01 

0.06 

0.08 

0.05 

0 

one 

0 

0.94 

0 

0 

0 

0.03 

0 

0 

0 

0.02 

two 

0.02 

0 

0.87 

0.04 

0.01 

0 

0.01 

0 

0.03 

0.02 

three 

0.01 

0 

0.03 

0.90 

0.02 

0.01 

0 

0 

0.01 

0.01 

four 

0.01 

0 

0 

0.02 

0.96 

0 

0 

0 

0 

0.01 

five 

0.01 

0.01 

0 

0.06 

0 

0.86 

0 

0.01 

0.01 

0.03 

six 

0 

0 

0 

0 

0.01 

0 

0.93 

0.05 

0 

0 

seven 

0.05 

0 

0 

0 

0 

0 

0.14 

0.81 

0.01 

0 

eight 

0.02 

0 

0.04 

0.02 

0 

0.02 

0 

0.07 

0.80 

0.03 

nine 

0 

0.01 

0 

0.01 

0.01 

0.04 

0 

0 

0.01 

0.92 
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Multisensor  Embedding  For  Sensor  Fusion 

•  Starting  with  K  input  sources  k  =  1...K. 

•  Compute  the  Laplace-Beltrami  embeddings  of  {Q&},  denoted 

,  where  m &  is  the  dimensionality  of  the  embedding  of 
the  /c’th  channel. 

•  Compute  the  unified  coordinates  set  O  =  {z i,  ...,zn  }  by 
appending  the  embeddings  of  each  input  sensor 

Zi  =  {(j)\, . . . ,  0^},  £  =  1  ...n,  ,  k  = 
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Audio  and  Visual  Data  Classification  (n  =  5  +  5) 


“0” 

“2” 

“3” 

“4” 

“5” 

“6” 

“8” 

“9” 

zero 

0.90 

0.00 

0.00 

0.00 

0.00 

0.00 

0.06 

0.04 

0.00 

0.00 

one 

0.00 

0.99 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.01 

two 

0.00 

0.00 

0.96 

0.01 

0.02 

0.00 

0.00 

0.00 

0.00 

0.00 

three 

0.00 

0.00 

0.00 

0.99 

0.00 

0.00 

0.00 

0.00 

0.00 

0.01 

four 

0.00 

0.00 

0.00 

0.04 

0.96 

0.00 

0.00 

0.00 

0.00 

0.00 

five 

0.00 

0.00 

0.00 

0.00 

0.00 

0.97 

0.00 

0.00 

0.02 

0.01 

six 

0.06 

0.00 

0.00 

0.00 

0.00 

0.00 

0.90 

0.04 

0.00 

0.00 

seven 

0.03 

0.00 

0.00 

0.00 

0.00 

0.00 

0.03 

0.93 

0.00 

0.00 

eight 

0.00 

0.00 

0.00 

0.00 

0.00 

0.01 

0.00 

0.01 

0.95 

0.03 

nine 

0.00 

0.01 

0.00 

0.00 

0.00 

0.01 

0.00 

0.00 

0.02 

0.96 
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Summary  Classification 


Channel  type 

“0” 

“2” 

“3” 

“4” 

“5” 

“6” 

“8” 

“9” 

Audio 

0.75 

0.94 

0.87 

0.90 

0.96 

0.86 

0.93 

0.81 

0.80 

0.92 

Visual 

0.90 

0.99 

0.90 

0.94 

0.93 

0.81 

0.87 

0.74 

0.75 

0.82 

Combined 

0.90 

0.99 

0.96 

0.99 

0.96 

0.97 

0.90 

0.93 

0.95 

0.96 
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"The  views  and  conclusions  contained  herein  are  those  of  the  authors  and  should  not  be  interpreted  as  necessarily  representing  the  official  policies  or  endorsements,  either  expressed 
or  implied,  of  AFRL/SNAT  (now  AFRL/RYAT)  or  the  U.S.  Government."  "This  material  is  based  on  research  sponsored  by  AFRL/SNAT  under  agreement  number  FA8650-05-1-1800 
(BAA  04-03-SNK  Amendment  3).  The  U.S.  Government  is  authorized  to  reproduce  and  distribute  reprints  for  Governmental  purposes  notwithstanding  any  copyright  notation  thereon." 


Final  Presentation  -  Part  2a 
Diffusion  Maps  and  Geometric  Harmonics  for  ATR 

FA8650-05-1-1800 
23  Nov  2004  -  31  Oct  2007 

Steven  W.  Zucker  Ronald  R.  Coifman 

Department  of  Computer  Science  and  Program  in  Applied  Mathematics 

Yale  University 


aThe  presentation  is  ©2007  Steven  W.  Zucker 


174 


Psychological  Questionnaires 

Answer  by  Yes  or  No 

Group  A 

•  I  find  it  hard  to  wake  up  in  the  morning. 

•  I’m  usually  burdened  by  my  tasks  for  the  day. 

•  I  love  dancing. 

What  about  Group  B? 

•  I  like  poetry. 

•  I  might  enjoy  being  a  dog  trainer. 

•  I  read  the  newspaper  every  day. 
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Group  A  are  questions  like  the  ones  in  the  MMPI-2  test, 
aimed  at  estimating  depression 

•  I  find  it  hard  to  wake  up  in  the  morning,  (yes) 

•  I’m  usually  burdened  by  my  tasks  for  the  day.  (yes) 

•  I  love  dancing,  (no) 

In  the  MMPI-2  a  (raw)  score  is  the  sum  of  ” correct  answers”. 

Group  B,  designed  to  test  for  other  conditions, 
seem  unrelated  to  depression. 

•  I  like  poetry.  (?) 

•  I  might  enjoy  being  a  dog  trainer.  (?) 

•  I  read  the  newspaper  every  day.  (?) 
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Questions: 

•  Are  Group  B  answers  informative  about  depression? 

•  If  so,  can  incomplete  questionnaires  be  scored  correctly? 

•  Is  the  space  of  answers  structured?  and  How? 

Answering  the  latter  suggests  an  approach  to  the  former. 

•  MMPI-2  structure 

•  Manifold  learning 
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MMPI-2  and  the  diffusion  framework 


•  Ambient  space:  x  E  567  dimensions 
(yes/no  answers  — >  ±1). 

•  A  set  of  responses  Xi  lie  on  or  near  a  low  dimensional  manifold 
M  in 


•  M  is  sufficiently  sampled  with  some  density  p  by  the  training 
set.  For  a  given  function  g  and  a  compact  subset  of 
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•  scales:  functions  on  the  answer  vectors  fdiag  nosis(x )  :  — ►  M. 

summation  of  ”  correct  answers” . 

•  diagnosis  G  {  anxiety,  depression,  . . . ,  hysteria  }. 

•  The  scoring  function  /  :  — >  M  is  smooth  on  M. 
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Scales  as  Functions  on  Data  Points:  Depression 


4 


D  scale  number  5 
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Elevated  on  One  Scale 


3 


2 


0 


-2 


■3 


&evet«j  on  i  scales  t*  iwe 


2 


•  Green:  pathological 

•  Blue:  Normal 
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Age  Scale  Not  Informative 
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•  When  a  function  maps  regularly  onto  data  points,  it  can  be 
extended  to  new  data  points. 

•  Fill-in  missing  data. 

•  check  consistency  of  data. 

•  find  ’’outliers” 
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Sensor  Integration  for  Segmentation- 1 


Shi/Malik  /  intensity 


RGB 


Combined 
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Sensor  Integration  for  Segmentation-2 


texture 


RGB 


Combined 
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Stereo  Correspondence  Problem 

Recover  the  depth  information  from  the  disparity  (difference  of 
image  coordinates)  of  the  corresponding  image  points. 
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Matching  Constraints  in  Stereo  Correspondence 

•  Epipolar  Constraint  (geometric  constraint). 

•  Ordering  Constraint  (heuristic  constraint). 


X? 
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Frontal  Parallel  Plane  Assumption  in  Stereo 

Correspondence 

•  Assume  surface  is  parallel  (i.e.  at  constant  depth)  to  the  image 
pairs. 

•  slide  window;  select  position  s.t.  max  SSD 


189 


Display  (brightness  =  depth) 
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(our  result) 


(belief  prop) 


Zitnick  and  Kanade,  PAMI  2000. 
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Next  Step:  Use  contextual  information  geometrically 
(“directed  diffusion”)  in  stereo  correspondence 

Road  map: 

•  Space  Curves  —  Frenet 

•  Smooth  Surfaces  —  Cartan 
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Imposing  Geometric  Constraints  Over  Neighboring 

Matching  Pairs 

•  Build  a  local  model  (Frenet  approximation)  for  every  (possible) 
curve  point  j. 

•  Predict  the  Position  and  the  Frenet  frame  at  a  nearby  position 
i.  Compare  with  the  measurements  at  i.  They  should  agree  if  i 
comes  from  the  same  curve  as  j. 

•  Enforce  such  consistency  and  only  retain  those  that  are 
compatible  with  their  neighbors. 
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•  Each  space  tangent  projects  to  a  pair  of  image  tangents. 

•  Both  positional  disparity  (A d)  and  orientation  disparity  (AO) 
used. 
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Results 


(view  2) 


(scale:  m) 
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Results 


(matched-L) 


(matched-R) 


(result) 


(scale:  m) 
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Stereo  Correspondence  for  Surface  Reconstruction 


•  Goal:  Dense  reconstruction  of  smooth  surfaces. 

•  Observe:  Tangent  plane  TP(M)  (in  solid  lines)  deviates  from 
the  frontal  parallel  plane  (in  dotted  lines). 
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Results 


Left  (reference)  image  Right  image 


GC+Subpixel  Our  Result 
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Tammy-Normals 
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Tammy- Zoom 


Surface  Normals 
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Zoom 


Organization  of  Spectral  Information 


0! - '  i  —  ■— ^_i_: 

0  2  4  6  8  10  12 


spectral  wavelength  -  microns 

Fig.  1  Reflectivity  of  camouflage  and  of  conifer  trees  and  grass,  showing  that  the 
camouflage  is  relatively  ineffective  in  the  4-6  gm  and  10-12  |im  regions  (after  D. 
Scribner  (NRL) 
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Munsell  Color  Patches 


5g 


The  Matlab  image  of  the  page  5G  of  the  Munsell  Book  of  Color.  RGB  colors  are 
calculated  from  the  AOTF  measured  spectra  of  the  Munsell  colors. 
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Color  Diffusion  Map  through  Retinal  Pigments 
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Color  and  geometry 
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Future  Work: 

•  Abstract  Features  (in  feature  space 

•  Geometry  of  fusion  of  boundary/texture/color  data  for 
improved  segmentation  and  classification. 

•  Fusion  of  spectral  and  spatial  data 

•  (Fusion  of  left/right  stereo  pairs):  feedback  to  image 
segmentation  following  biological  model. 
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